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Summary 

The  main  goal  of  the  project  is  designing  in  RSFQ  technology  a  20-GHz  8-bit-wide  energy 
efficient  processor  datapath  consisting  of  an  8-bit  ALU,  an  8x8-bit  Register  File,  and  an 
Instruction  Decoder.  We  have  had  several  project  modifications  resulted  in  additional  tasks,  such 
as  the  development  of  energy-efficient  zero-static-power  dissipation  SFQ  technology,  the 
development  of  energy-efficient  interface  based  on  a  low  input  voltage  polarization  modulating 
VCSELs,  and  the  development  of  superconducting  ferromagnetic  Random  Access  Memory. 

During  the  project,  we  have  achieved  the  following  results: 

1)  The  VHDL-level  design  of  the  8-BIT  datapath  was  performed  in  collaboration  with  SUNY  at 
Stonybrook  group.  Based  on  VHDL  simulation  results,  we  have  designed  physical-level  layout 
for  HYPRES’s  standard  fabrication  process. 

2)  We  have  designed  an  8-bit  Arithmetic-Logic  Unit  (ALU)  in  RSFQ  technology.  The  design  is 
based  on  Kogge-Stone  CLA  adder  and  employs  wave-pipeline  architecture.  The  ALU  was 
fabricated  and  successfully  tested  at  20-GHz  clock  frequency,  the  major  and  critical  part  of  the 
proposed  processor  datapath.  The  ALU  was  fully  tested  for  functionality  and  operational 
margins.  At  low  speed,  the  measured  critical  margin  for  bias  current  was  +/-  7%.  Using  an  on- 
chip  test-bed  based  on  the  controlled  SFQ  relays,  we  have  fully  tested  the  ALU  at  20-GHz  clock 
frequency. 

3)  We  have  designed,  fabricated,  and  tested  an  8-byte  Register  File  comprising  a  matrix  of  two 
banks  of  four  8-bit  registers  integrated  with  control  logic  block.  The  complete  8x8-bit  Register 
File  was  successfully  demonstrated.  Besides  the  data  port  operation,  all  64  memory  cells  of  the 
register  file  were  tested  individually  at  the  nominal  bias  current.  The  operational  margins  for  dc 
bias  current  were  varying  from  -14%  /  +25%  to  -1%  /  +2%. 

4)  In  an  additional  effort,  we  have  developed  a  novel  resistor-free  biasing  scheme  for  RSFQ  with 
zero  static-  and  minimal  dynamic  power  dissipation.  We  called  it  energy-efficient  RSFQ  or 
ERSFQ.  It  is  fully  compatible  on  a  cell  level  with  resistive  RSFQ  logic  allowing  us  to  utilize 
RSFQ  cell  library  with  minor  modifications.  Using  this  approach,  we  have  designed  and 
successfully  demonstrated  at  high  (up  to  60  GHz)  speed  a  number  of  circuits  including  a  static 
frequency  divider  by  220,  a  detector  digital  readout  (ADC),  and  two  types  of  an  8 -bit  parallel 
adder.  The  main  achievement  in  energy  dissipation  reduction  was  demonstration  of  ERSFQ  8-bit 
parallel  adder  dissipating  160  aJ  per  operation.  All  investigated  ERSFQ  circuits  have  shown  no 
perfonnance  degradation  comparing  to  their  RSFQ  counterparts  and  in  some  cases  even  excelled 
them. 

5)  The  8-bit  ALU  was  designed  in  new  ERSFQ  technology.  The  new  ALU  architecture  is  based  on 
wave-pipeline  ripple-carry  adder  featuring  high  throughput  (simulated  44  GHz  at  4.5  kA/cm2 
process),  asynchronous  carry  propagation  and  small  latency.  At  the  same  time,  it  operates  with 
high  data  skew  factor  that  should  be  matched  by  the  register  file. 

6)  The  other  additional  goal  of  this  multi -phase  project  is  to  develop  and  demonstrate  the  energy- 
efficient  output  data  interface  between  cryogenic  4  K  superconducting  modules  and  room- 
temperature  semiconductor  systems  using  a  combination  of  energy-efficient  on-chip  drivers,  low 
loss  and  dispersion  cables,  and  polarization  modulating  vertical-cavity  emission  lasers  (PM 
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VCSELs).  During  this  project  period,  we  completed  the  fabrication  and  testing  of  the  second 
generation  designs  of  PM  VCSELs  with  modifications  introduced  during  the  previous  project 
period.  This  new  design  is  based  on  a  “half-VCSEL”  structure  with  dielectric  top  distributed 
Bragg  reflector  (DBR).  We  fabricated  the  first  iteration  of  VCSEL  devices  with  dielectric  DBR 
and  demonstrated  improvement  in  performance  although  with  lesser  polarization  control.  The 
second  fabrication  iteration  to  address  polarization  control  is  80%  complete.  Preliminary  testing 
of  these  devices  before  deposition  of  the  top  dielectric  DBR  mirror  shows  diode  current-voltage 
characteristics,  and  clear  electroluminescence  was  observed.  In  addition,  we  have  completed, 
optimized  and  employed  a  cryogenic  setup  for  cryogenic  VCSEL  testing  in  wide  temperature 
range.  It  is  based  on  Sumitomo  two-stage  cryocooler  with  accurate  temperature  control  of  the 
first  stage.  The  measurement  process  and  data  collection  is  performed  using  the  developed  for 
the  measurements  Labview  program.  We  tested  a  set  of  VCSEL  samples  produced  by  Univ.  of 
Illinois  team  to  verify  and  calibrate  our  cryogenic  setup.  We  also  fabricated  and  successfully 
tested  new  on-chip  energy-efficient  driver  based  on  ERSFQ  logic.  These  drivers  are  based  on 
dc/SFQ  converters  re-designed  to  ERSFQ  logic.  The  bias  of  the  driver  output  stage  was 
implemented  via  the  output  data  line  from  the  PM  VCSEL. 

7)  Another  added  task  was  the  development  of  approaches  to  maximizing  energy-efficiency  of  SFQ 
digital  circuits.  We  performed  the  first  experimental  demonstration  of  recently  proposed  energy- 
efficient  single  flux  quantum  logic  with  zero  static  power  dissipation,  eSFQ.  We  also 
demonstrate  that  the  introduction  of  passive  phase  shifters  allows  the  reduction  of  dynamic 
power  dissipation  by  about  20%.  Two  types  of  demonstration  eSFQ  circuits,  shift  registers  and 
demultiplexers  (deserializers),  were  implemented  using  the  standard  HYPRES  4.5  kA/cm2 
fabrication  process. 

8)  The  goal  of  this  additional  task  is  to  perfonn  development  of  a  4K  Superconducting 
Ferromagnetic  MRAM  circuits  compatible  with  Josephson  junction  digital  energy  efficient  SFQ 
circuits.  A  scalable,  energy-efficient  memory  element  based  on  Magnetic  Josephson  junctions 
(MJJs)  was  developed  and  demonstrated.  For  SIsFS  MJJ,  we  demonstrated  the  memory 
properties  of  two  memory  states  with  different  critical  current  values  and  high  IcRn  comparable 
to  that  of  conventional  SIS  Josephson  junctions.  We  have  also  demonstrated  a  superconducting 
ferromagnetic  transistor  (SFT)  -  a  three-terminal  device  with  good  input/output  isolation  for 
integration  with  MJJ-based  memory  cell  capable  of  perfonning  the  memory  cell  selector  function 
in  random  access  memory  arrays. 
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I.  Demonstration  of  RSFQ  processor  datapath 

This  project  is  the  first  demonstration  of  the  scalable  processor  microarchitecture  suitable  for 
the  implementation  of  wide  data  path  digital  circuits. 

High-perfonnance  computing  (HPC)  is  one  of  the  fields,  in  which  high  speed  and  low  power 
dissipation  of  superconducting  digital  circuits  has  been  the  motivation  for  several  Josephson 
microprocessor  development  efforts  following  the  path  established  by  IBM’s  famous  Josephson 
project1 2 3 4.  Unfortunately,  the  requirement  of  global  timing  for  the  superconductor  nc-powered 
latching  logic,  as  well  as  the  high  power  dissipation  of  the  voltage-generating  elements  of  this 
logic  family,  along  with  some  other  technical  obstacles,  made  the  implementation  of  high-speed 
processors  impossible  at  tens-of-gigahertz  clock  rate. 

With  RSFQ  logic",  the  development  of  a  high-perfonnance  superconductor  processor  became 
feasible.  Non-latching,  dc-powered  RSFQ  logic  featuring  local  and  self-timings  enabled  the 
design  of  processing  modules  operating  at  tens  of  gigahertz  with  very  low  power  dissipation. 
There  were  two  projects  for  developing  a  superconductor  computer4  5'6.  A  major  part  of  these 
projects  was  the  developing  an  RSFQ  microprocessor  operating  at  minimum  power  while 
clocking  at  very  high  rates. 

Until  recently,  only  two  8-bit  prototypes  of  such  a  microprocessor  -  FLUX5  and  CORE6  -  were 
developed.  Only  CORE  was  successfully  demonstrated.  And  neither  of  them  used  true  8-bit  wide 
datapath  processing  in  their  pipelines.  FLUX  microprocessor  followed  processing-in-registers 
microarchitecture  that  allowed  eight  ALU  operations  to  proceed  simultaneously  in  its  datapath, 
producing  up  to  eight  bits  per  cycle  (albeit  belonging  to  different  operations).  CORE  used  a 
simple  bit-serial  pipeline  generating  one  bit  of  result  per  cycle. 

The  FLUX  microarchitecture  was  able  to  hide  the  latency  of  its  eight  bit-serial  processing 
pipelines  by  allowing  any  instruction  to  start  its  execution  as  soon  as  the  least  significant  bits  of 
its  input  operands  are  calculated.  In  the  bit-serial  CORE  processor,  an  instruction  needs  to  wait 
until  all  eight  bits  of  its  inputs  are  calculated  sequentially. 

Although  these  approaches  allowed  the  design  of  low-complexity  execution  pipelines  in  these 
first  microprocessor  prototypes,  they  are  not  scalable  or  applicable  to  future  32-/64-bit  RSFQ 
processors.  The  development  of  a  wide-datapath  microprocessor  is  crucial  for  superconductor- 
based  HPC. 


1  W.  Anacker,  “Josephson  Computer  Technology:  An  IBM  Research  Project”,  IBM  Journal  of  Research  and 
Development,  vol.  24,  no.  2,  pp.  107-112,  Mar.  1980. 

2  K.  Likharev  and  V.  Semenov,  "RSFQ  logic/memory  family:  A  new  Josephson-j unction  technology  for  sub¬ 
terahertz  clock-frequency  digital  systems",  IEEE  Trans.  Appl.  Supercond.,  vol.  1,  pp.  3-28,  Mar.  1991. 

3  O.  A.  Mukhanov,  S.  V.  Rylov,  V.  K.  Semenov,  and  S.  V.  Vyshenskii,  "RSFQ  logic  arithmetic,"  IEEE  Trans. 
Magn.,  vol.  MAG-25,  no.  2,  pp.  857-860,  Mar.  1989. 

4  T.  Sterling,  “A  design  analysis  of  a  hybrid  technology  multithreaded  architecture  for  petaflops  scale  computation,” 
in  Proc.  of  International  Conference  on  Supercomputing,  p.  386-296,  1999. 

5  P.  Bunyk,  M.  Leung,  J.  Spargo,  M.  Dorojevets,  “FLUX-1  RSFQ  microprocessor,”  IEEE  Trans.  Appl.  Supercond., 
vol.  13(1),  p.433,  2003. 

6  A  Fujimaki,  M  Tanaka,  T  Yamada,  Y  Yamanashi,  H  Park,  N  Yoshikawa,  “Bit-Serial  single  flux  quantum  microprocessor  CORE,”  IEICE 

Trans.  Electron.,  vol.  E91-Cpp  342-349, Mar.  2008. 
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1.  VHDL-level  8-bit  datapath  design  and  verification 

VHDL  design  of  the  datapath  was  performed  by  group  of  prof.  M.  Dorojevets  of  Stonybrook  University 
(Stonybrook,  NY)  using  cell  library  provided  by  HYPRES. 

The  8-bit  datapath  consists  of  three  major  blocks: 

Instruction  Control:  Instruction  issue  &  instruction  decode  units  (11U/1DU) 

Data  Storage:  Register  file  (RF) 

Processing:  Arithmetic-logic  unit  (ALU) 

The  Instruction  Issue  Unit  (11U)  and  Instruction  Decode  Unit  (1DU)  are  responsible  for: 

resolving  data  hazards  (data  dependencies  between  instructions), 
fetching  instructions  from  the  instruction  buffer, 
decoding  them  into  control  signals  for  the  datapath. 

The  11U  consists  of  a  5-instruction  buffer  with  each  instruction  containing  20-bits.  The  buffer  is 
essentially  a  group  of  shift  registers  with  a  parallel  read  out. 

The  11U  also  contains  an  Issue  Control  Block  which  resolved  data  hazards  by  delaying  instruction  issue 
until  both  source  operands  for  the  instruction  are  calculated.  The  issuing  of  instructions  is  based  on  a 
dataflow  mechanism.  Each  instruction  has  a  field  that  specifies  the  first  instruction  in  the  instruction 
buffer  that  needs  the  instruction’s  result. 

The  8-bit  Multi-Port  register  file  stores  and  delivers  data  to  ALU  for  further  computation.  The  register  file 
can  perform  two  simultaneous  read  operations  and  one  write  operation  from  ALU  providing  that  separate 
source  registers  are  used  for  each  operation. 

The  8-bit  register  file  has  eight  8-bit  registers.  Two  of  the  registers  are  used  as  input/output  left  and  right 
ports. 

The  register  file  is  divided  into  two  banks,  providing  left  and  right  operands  for  ALU  operations, 
respectively.  The  left  and  right  register  bits  are  interleaved,  so  that  the  most  significant  bit  from  left  hank 
is  placed  next  to  most  significant  bit  from  the  right  register  bank  and  so  on.  This  placement  is  dictated  by 
the  ALU  layout  to  provide  the  smallest  data  skew  on  data  inputs  to  ALU. 

Six  out  of  the  eight  registers  provide  non-destructive  data  readout.  The  two  I/O  ports  can  be  loaded  bit- 
serially  from  the  outside  of  the  chip  and  read  as  source  data  to  ALU  (with  destructive  read-out).  When 
written  with  the  results  of  ALU  operations,  the  I/O  ports  can  be  bit-serially  read  out  to  the  outside  of  the 
chip. 

The  data  stored  in  each  of  the  eight  8-bit  registers  are  read  using  a  “one  hot-encoded”  read  vector  signal. 
The  output  of  each  register  bit  is  merged  with  corresponding  outputs  of  other  registers  in  the  same  bank  as 
well  as  an  immediate  value  from  an  instruction. 

Besides  two  operands,  the  ALU  opcode  and  Ready  signals  are  received  from  the  Instruction  decode  stage 
and  broadcasted  to  the  RF  output,  one  copy  for  each  ALU  bit  slice. 

The  register  file  contains  an  asynchronous  write-back  destination  register  address  FIFO  which  is  used  to 
deliver  and  write  data  to  a  destination  register  when  the  data  and  their  ready  signal  from  ALU  arrive  at  the 
write  input  of  the  register  file.  Four  register  destination  addresses  received  from  the  Instruction  decode 
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stage  can  be  buffered  inside  the  write-back  FIFO,  thus  allowing  up  to  four  outstanding  (on-going)  ALU 
operations  to  be  in  the  state-of-execution  within  the  datapath. 

The  ALU  implements  wave-pipelining  techniques  that  allow  it  to  effectively  use  all  of  its  stages  without 
the  need  for  registers  and  clocks  in  between.  The  major  goal  is  to  find  efficient  design  techniques  capable 
of  improving  the  design’s  performance  in  terms  of  latency  and  rate  while  tolerating  large  delay 
fluctuations  in  wide  datapath  circuits.  Using  the  Kogge-Stone  adder  as  a  starting  point,  an  ALU  has  been 
developed  to  perform  a  set  of  logic  operations  such  as  AND,  NOR,  XOR,  XNOR,  etc.  Additionally,  the 
ALU  also  has  four  types  of  ADD  operations  namely,  normal  ADD,  ADD  with  inverted  A,  ADD  with 
inverted  B,  and  ADD  with  inverted  A  and  B. 

ALU  has  a  latency  of  approximately  392  ps  (also  -eight  20  GHz  datapath  chip  cycles)  and  an  overall 
complexity  of  7,3 1 9  JJs.  It  is  capable  of  running  at  a  maximum  rate  of  -29  GHz. 

The  simulations  results  of  the  datapath  confirm  full  functionality  of  all  units  running  together  up  to 
-23.3  GHz  (43 -ps  cycle  time).  By  inspecting  the  violations  log  of  the  simulation,  all  failed  waves  are  due 
to  timing  violations  in  the  write-back  destination  FIFO  within  the  register  file,  making  it  the  rate  limiting 
component  of  the  datapath. 

Depending  on  the  destination  register  chosen  for  an  instruction,  the  full  datapath  latency  can  range  from 
-786  ps  up  to  -878  ps.  Overall,  the  entire  8-bit  datapath  has  a  complexity  of  -  16K  JJs. 

TABLE  II.  Datapath  Hardware  Consumption 

Datapath  Block  Total  JJs  %JJs 

IIU/IDU  4,467  27.97% 

RF  4,390  27.48% 

ALU  7,116  44.55% 

Total  15,973  100.00% 
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2.  RSFQ  8-bit  ALU  design  and  demonstration 


Arithmetic  Logic  Unit  (ALU)  Chip 

Oiu  ALU  is  based  on  a  Kogge-Stone  type  adder.  Fig.  2.1.1  shows  the  block  diagram  of  the 
ALU.  It  consists  of  four-  types  of  blocks:  INIT,  ROUTl,  ROUT 2.  and  SUM,  -  connected  with 
passive  transmission  lines  (PTLs).  The  most  important  pail  of  the  ALU  is  the  INIT  block.  It 
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Fig.  2.1.1.  Block-diagram  of  the  8-bit  RSFQ  AL  U. 


r  \ 


For  testing  ALU,  we  have  fabricated  two  chips,  -  one  is  for  low-speed  functionality  test 
(Fig.  2.1.2)  and  the  other  one  contains  the  embedded  high-speed  test-bed  circuitry.  Each  chip 
has  about  8,000  Josephson  junctions  (active  elements  of  RSFQ  technology). 
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Chip  with  8-bit  ALU 


Clock  and  Data 
distribution 

Layer  of 
INIT  blocks 


Layers  of 

ROUT1  and  R0UT2 
blocks 


Layer  of 
SUM  blocks 


Number  ofjfs  =  b^l6  microPhot°graPh  °fa  chiP  with  8-bit  ALU. 

The  ALU  has  12  instructions  in  its  instruction  set.  They  are  coded  with  4  bit  instruction  address. 
This  is  a  complete  instruction  set  that  allows  programming  any  logic  or  arithmetic  operation. 


Operation 

Op3 

Op2 

Opl 

OpO 

ADD 

1 

1 

0 

0 

ADD-Invert  A 

1 

1 

1 

0 

ADD-Invert  B 

1 

1 

0 

1 

ADD-Invert  A  and  B 

1 

1 

1 

1 

AND 

1 

0 

0 

0 

NOR 

1 

0 

1 

1 

Set  all  bits  to  “1” 

0 

0 

1 

1 

AND-Invert  A 

1 

0 

1 

0 

AND-Invert  B 

1 

0 

0 

1 

XOR 

0 

1 

0 

0 

XNOR 

0 

1 

0 

1 

NOP 

0 

0 

0 

0 

Table  1.  The  instruction  set. 

In  the  instmction  set  of  the  processor  (Table  1),  an  addition  (ADD)  function  is  the  most 
complex  and  hardware  consuming  arithmetic  operation.  This  led  us  to  design  our  ALU  based  on 
a  parallel  adder  design. 
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ALU  Functionality  Test 


r  t  i  1 


ALU  functionality  test.  A+B 
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Fig.  2.1.3.  ALU functionality  test  for  operation  ‘ADD”  {A+B}. 
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Fig.  2.1.3  shows  the  correct  operation  of  the  ALU  adding  8-bit  numbers  (A+B).  The  bottom 
trace  is  the  Ready  signal,  applied  at  every  instruction  execution.  The  8-bit  operand  A,  operand  B, 
and  the  8-bit  output  are  shown  in  ascending  order.  The  result  of  the  addition  process  comes  out 
as  modulo  256. 

In  order  to  provide  subtraction,  the  ALU  can  invert  one  or  both  operands  and  add  them  at  a 
single  instruction.  The  most  complex  operation  in  the  instruction  set  is  “ADD-Invert  A  and  B”, 
which  is  essentially  equivalent  to  the  arithmetic  operation  (-2-A-B).  The  test  result  of  the  ALU 
perfonning  this  operation  is  shown  in  Fig.  2.1.4.  Here,  we  have  preserved  the  same  order  of 
traces  and  the  same  operand  pattern  as  in  Fig.  2.1.3. 
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ALU  functionality  test.  ~A  +  ~B 
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Fig.  2. 1.4.  ALU functionality  test  for  operation  “ADD-Invert  A  and  B  ”  (both  operands  A  and  B 
are  inverted  before  adding)  {~A  +  ~B}. 


The  functionally  simplest  operations  are  the  so-called  bit-logic  operations,  such  as  AND, 
XOR,  NOR,  etc.  They  do  not  produce  a  “carry”  bit  propagating  across  the  ALU.  The  results  of 
logic  operations  performed  in  the  INIT  blocks  of  the  ALU  (Fig.  2.1.1)  go  directly  to  the  output. 
This  property  of  the  bit-logic  operations  simplifies  the  test  pattern  necessary  to  perform  a 
complete  test  of  the  ALU. 

The  low-speed  functionality  test  results  for  four  bit-logic  operations  are  shown:  operation 
AND  in  Fig.  2.1.5;  operation  (Inv  A)  AND  B  in  Fig.  2.1.6,  operation  A  AND  (Inv  B)  in 
Fig.  2.1.7,  operation  NOR  in  Fig.  2.1.8;  XOR  in  Fig.  2.1.9;  and  XNOR  in  Fig.  2.1.10.  For 
consistency,  we  placed  the  traces  in  the  same  order  as  in  Fig.  2.1.3. 

The  critical  (minimal)  operating  margins  on  dc  bias  current  were  +/-  7%.  This  is  quite 
sufficient  for  functioning  ALU  with  acceptable  bit-error  rate.  We  plan  testing  more  chips  in 
order  to  find  yield  and  statistics  on  operational  margins. 
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ALU  functionality  test.  ~A  &  B 
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Fig.  2.1.6.  ALU functionality  test  for  operation  “AND-lnvert  A”  {(~A)&B}. 
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ALU  functionality  test.  A  &  ~B 
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ALU  functionality  test.  ~(AIB) 
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Fig.  2.1.8.  ALU functionality  test  for  operation  “NOR  ”  {~(A\B)}. 
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ALU  functionality  test.  AAB 
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ALU  functionality  test.  ~(AAB) 


Out7 

Out6 

Out5 

Out4 

Out3 

Out2 

Outl 

OutO 

B7 

B6 

B5 

B4 

B3 

B2 

B1 

BO 

A7 

A6 

A5 

A4 

A3 

A2 

A1 

AO 

Ready 


Fig.  2.1.10.  ALU functionality  test  for  operation  “XNOR  ”  {~(AAB)}. 
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ALU  High-Speed  Test 

The  8-bit  ALU  has  as  many  as  2 1  high-speed  input-  and  8  high-speed  output  terminals 
(Fig.  2.1.11).  A  20-GHz  20-channel  pattern  generator  capable  of  producing  highly  synchronous 
data  streams  is  not  available.  It  is  a  common  nractice.  to  test  hieh-sneed  digital  circuits  using  on- 
chip  test  f  mg  of  our 

ALU  circi 


A  B  Instr.  Clock 


'  t 


Result  Out 


Inputs: 

□8-bit  operand  A 
□8-bit  operand  B 
□4-bit  instruction  code 
□  l-bit  Clock  (ready) 

Output: 

□8-bit  output 


Fig.  2. 1.11.  The  8 -bit  ALU  I/O  count. 

The  test-bed  design  approach  is  based  on  generating  operands  and  control  signals  from  the  clock 
(ready)  by  means  of  SFQ  relays  controlled  by  dc  or  low-frequency  signals  (Fig.  2.1.12).  A  relay 

lets  an  SFQ  p  '  * - '  - *  '  - - ''  '  '  '  *  '*  ’  - ent  is 

off. 


SFQ  In 


Control  current 


SFQ  Out 
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Fig.  2.1.12.  Schematics  and  symbol  of  an  SFQ  relay. 
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We  designed  a  chip  for  the  high-speed  test  of  the  ALU.  In  this  chip  (Fig.  2.1.13),  the  high-speed 
clock  (ready)  signal  is  applied  to  all  input  data  terminals  via  SFQ  relays  described  above.  By 


XT  i.  1 


High-speed  test  bed 


Clock 


Toggling  SFQ-to-DC  voltage  converters 


Fig.  2.1.13.  Block-diagram  of  the  ALU  high-speed  test  chip. 

For  the  output  amplifiers,  we  used  the  same  toggling-type  SFQ-to-DC  converters  as  for  the  low- 
speed  functionality  test.  A  toggling  SFQ-to-DC  converter  switches  its  dc-voltage  state  between 
0  mV  and  0.5  mV  and  back  every  time  the  SFQ  pulse  appears.  On  the  low-speed  oscilloscope,  a 
20-GHz  output  (steady  “1”)  will  be  shown  as  a  single  line  at  0.25  mV  (average  voltage  between 
two  SFQ/DC  states  toggling  at  20  GHz),  while  the  absence  of  output  signal  (steady  “0”)  will 
appear  either  as  a  voltage  level  of  0.0  mV  or  0.5  mV  (see  the  inset  in  Fig.  2. 1.13). 
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Fig.  2.1.14.  High-speed  test  of  the  AL  U  at  20  GHz.  Constant  operands  and  variable 
instructions. 

A  typical  result  of  ALU  high-speed  test  is  shown  in  Fig.  2.1.14.  Here,  we  applied  constant 
operands  (A=101,  B=45)  and  varied  a  4-bit  instruction  code.  A  double  line  on  the  picture 
indicates  output  “0”  and  a  single  line  represents  output  “1”.  The  4-bit  instruction  code  (Table  I) 
was  controlled  by  low-frequency  pattern  generators  in  order  to  apply  different  instructions  to  the 
operands  A  and  B.  The  correct  output  is  shown  for  addition,  logical  AND,  and  logical  XOR 
operations. 

The  most  delay-sensitive  operation  is  an  addition,  when  the  carry  signal  is  being  generated  and 
propagated  along  the  ALU.  In  Fig.  2.1.15,  we  have  fixed  operand  A  at  value  127  (to  avoid 
modulo  256  arithmetic)  and  alternated  B  between  0  and  1.  As  Fig.  2.1.15  shows,  the  ALU  output 
correctly  switches  between  values  127  and  128  in  accordance  with  changes  of  operand  B. 
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ALU  operation  A+B 

A  B  B=1  B=0  B=1  B=0  B=1 


(127)  (0/1) 

1  0/1 
1  0 

1  0 

1  0 

1  0 

1  0| 

1  OB 

0  OB 

128  127 


OutO 

Outl 

Out2 

Out3 

Out4 

Out5 

Out6 

Out7 

127  128 


Fig.  2.1.15.  A  20-GHz  test  for  the  correct  carry  propagation  in  the  ALU.  Addition  of  12  7  and  1. 

The  extreme  case  of  addition  “255+1”  is  shown  in  Fig.  2.1.16.  Naturally,  in  8-bit  logic,  number 
256  modulo  256  is  0.  It  shows  the  correct  propagation  of  carry  through  the  whole  ALU,  albeit  it 
does  not  look  spectacular.  This  is  the  most  critical  (delay-wise)  case  among  all  ALU  operations. 
We  measured  +1-5%  operating  margins  on  dc  bias  current.  The  ALU  was  operating  properly  at 
higher  frequencies.  It  was  tested  at  23  GHz  and,  at  some  operations,  up  to  33  GHz.  Above  this 
frequency,  most  likely,  HF  characteristics  of  the  cryoprobe  used  in  the  experiment  did  not  allow 
us  to  apply  undistorted  signal. 

In  order  to  test  bit-error  rate,  all  eight  outputs  of  the  ALU  were  supplied  with  complementary 
outputs  (inverted  data).  So,  for  any  operation,  we  can  pick  between  direct  and  inverted  output  for 
providing  the  set  of  eight  zeros.  In  this  case,  all  eight  toggling  SFQ/DC  converters  generate  no 
voltage.  Switching  voltage  state  of  at  least  one  of  the  8  outputs  indicates  an  error  occurred  during 
ALU  operation. 

We  have  observed  error-free  operation  at  20  GHz  during  17  minutes.  That  gives  us  error  rate  of 
~5T0'14. 
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ALU  operation  A+B 

A  B  B=0  B=1  B=0  B=1  B=0 

(255)  (0/1) 

1  0/1  OutO 

1  o  Outl 

1  o  Out2 

1  o  Out3 

1  o  Out4 

1  01  Out5 

1  0  (  Out6 

1  01  Out7 

255  256  255  256  255 

Fig.  2.1.16.  A  20  GHz  test  for  the  correct  cany  propagation  in  the  ALU.  Addition  of 255  and  1. 
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3.  SFQ  8-byte  Register  File  design  and  demonstration 

One  of  the  major  pails  of  the  processor  datapath  is  a  Register  File  (RF).  It  comprises  a  matrix  of 
two  banks  of  four  8-bit  registers  integr  ated  with  control  logic  block  (Fig.  1.3.1). 


Fig.  1.3.1.  Register  File  cell-level  schematics. 

We  have  designed,  fabricated,  and  currently  are  testing  the  complete  Register  File  integrated  on  a 
1  -cur  chip  (Fig.  1.3.2).  The  chip  is  designed  for  standard  HYPRES  fabrication  process.  The 
Register  File  has  been  designed  in  traditional  RSFQ  technology. 
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Fig.  1.3.2.  The  8-word  Register  File  integrated  on  a  - cm 2  chip. 


While  testing  of  the  whole  Register  File  is  in  progress,  we  have  successfully  tested  all  critical 
parts  of  the  circuits.  These  parts  (shown  by  colored  frames  in  Fig.  1.3.1)  were  placed  on  separate 
chips  and  tested  for  functionality.  We  have  performed  a  comprehensive  functionality  test  using 
automated  test  setup. 

A  successful  test  of  a  3x3  matrix  of  dual-port  memory  (blue  frame  in  Fig.  1.3.1)  has  been 
reported  before.  Using  this  computerized  test  setup  we  were  able  applying  the  full  test  pattern  to 
every  cell  in  the  matrix,  thus  measuring  operational  margins  of  each  out  of  9  cells  in  the 
fragment.  All  tested  cells  worked  within  +/-  15%  margins  on  dc  bias  current  (Fig.  1.3.4). 
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The  top  row  of  cells  (fragment  marked  with  yellow  fr  ame  in  Fig.2.1)  has  most  complicated  test 
logic  (Fig.  1.3.5).  These  memoiy  cells  have  two  ports.  Besides  parallel  connection  to  the 
datapath,  they  are  comiected  in  serial  providing  external  data  load/nnload.  We  have  successfully 
tested  this  fragment  with  operational  dc  bias  current  margins  +/-  1 8%. 


Serial  interface  test 


. 


Parallel  interface  test 

Parallel  Write  3 
Left  Port  3  Out 
Right  Port  3  Out 
Parallel  Write  2 
Left  Port  2  Out 
Right  Port  2  Out 
Parallel  Write  1 
Left  Port  1  Out 
Right  Port  1  Out 
Read  Left  Ports 
Read  Right  Ports 
Left  port  reset 
Right  port  reset 
Left  port  serial  IN 
Left  port  serial  clock 
Left  port  serial  OUT 
Right  port  serial  IN 
Right  port  serial  clock 
Right  port  serial  OUT 


Fig.  1.3.5.  Functionality’  test  pattern  of  a  fragment  of  external  port  of  the  register  file  (yellow- 
marked fragment  in  Fig.2.1). 

The  combination  of  3x2  memory  matrix  integr  ated  with  3  poll  cells  (red-colored  fragment  in 
Fig.2.1)  has  also  been  successfully  tested  with  operational  dc  bias  current  margins  +/-  5%. 

The  complicated  test  pattern  for  testing  the  whole  Register  File  requires  two  automated  test 
setups  working  in  parallel.  We  are  hoping  to  get  results  soon. 
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Testing  of  RSFQ  8-word  Register  File 

The  requirements  to  the  register  file  architecture  induced  by  the  ALU  were  an  8 -bit  word  size, 
20-GHz  throughput,  minimal  (less  than  200  ps)  latency,  independent  read/write  access,  and  a 
reliable  interface  to  the  external  memory  (for  testing).  The  cell-level  design  was  done  by  Stony 
Brook  University  team  using  their  VHDL-based  design  flow  with  HYPRES  cell  library  timing 
data. 

The  register  file  can  perform  two  simultaneous  read  operations  and  one  write  operation  from 
ALU  providing  that  separate  source  registers  are  used  for  each  operation  where  both  reads  are 
fetched  from  distinct  register  banks  (left  vs.  right).  We  have  designed  a  register  file  based  on 
SFQ  latches,  such  as  NDRO  cell  and  several  flavors  of  B-flip-flop  cell.  The  designed  register  file 
(Fig.  1.3.6)  consists  of  4  pairs  of  8-bit  registers  that  are  bit-by-bit  interleaved.  The  top  row  of  the 
register  file  in  Fig.  1.3.6  consists  of  dual-port  memory  cells  (D2FF)  utilizing  one  port  as  a 
parallel  interface  to  the  ALU  and  the  second  port  as  a  serial  interface  to  the  external  memory. 

The  other  3  rows  comprise  NDRO  memory  cells  with  parallel  interface  to  the  ALU. 

In  the  middle  of  the  register  file,  there  is  a  Decode  and  Control  block.  It  used  for  buffering 
destination  register  addresses  and  providing  control  signals  that  are  used  for  routing  write  data 
from  ALU  to  a  target  destination  register.  Its  FIFO  can  buffer  up  to  four  register  write  addresses. 
The  control  section  logic  has  to  be  a  single  block  as  it  contains  the  critical  path  of  register  file. 
Hence,  the  register  file  operating  speed  depends  on  the  layout  of  this  block. 


Read  A  & 
Rdy_read  B  reg 


a  b  in  a  b  in  a  b  in 


Fig.  1.3.6.  Block  diagram  of  the  register  file. 


To  simplify  the  register  file  design,  only  one  of  each  pair  of  left  and  right  registers  occupying  the 
same  row  of  each  bank  can  be  calculated  at  any  moment.  In  other  words,  it  is  not  possible  to 
have  two  on-going  operations,  one  of  which  writing  to  register  in  the  left  bank  and  another  -  to 


24 


HYPRES  Final  Report  W911NF-09-C-0036 


register  in  the  right  bank.  But  simultaneous  reading  from  both  banks  or  reading  from  one  bank 
and  writing  to  the  other  bank  is  possible. 

.  ...  o 

The  8x8-bit  register  file  has  been  designed  and  fabricated  using  the  standard  1-um  4.5-kA/cm 
HYPRES  process.  The  register  file  has  size  of  4.4  nun  x  3.0  mm  and  consists  of -4000 
Josephson  junctions.  The  total  length  of  passive  transmission  lines  (PTL)  used  in  wiring  is 
~0.2m. 

We  have  performed  a  complete  functionality  test  of  the  register  file  without  Control  block.  Tire 
control  block  was  successfully  tested  separately.  Due  to  complicated  test  patterns,  the  automated 
test  sehip  was  used  for  thorough  testing  of  the  register  file. 

Fig.  1.3.7  shows  the  test  pattern  for  testing  serial  interface  of  the  right-bank  port  register.  The 


ft 
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Out  R0 

Out  R1 

Out  R2 

Out  R3 
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l\  Out  R6 
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Out  R7 
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rr  " 

^  ir 
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f  i 

S  Data 

•  m  mm 

Fig.  1.3. 7.  Low-speed  functionality  test  of  serial  interface  to  the  right-bank  port  register  (top 
row  in  Fig.  1.3.6). 

For  testing  parallel  interface  to  AFU,  we  have  tested  all  registers  cell-by-cell,  accessing  them 
individually.  Fig.  1.3.8  shows  a  typical  pattern  for  testing  a  memory  register  cell  hr  row  (register) 
M  (0-3)  and  column  (bit)  N  (0-7).  Both  (left  and  right)  register  banks  are  tested  in  this  pattern. 
Fust,  “1”  is  written  down  to  the  left  bank,  then  readout  signal  verifies  content  of  both  NDROs 
(only  left  bank  reads  out  “1”);  then  “1”  is  written  down  to  the  right  bank  (both  banks  output  “1”); 
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and  finally,  “0”  is  written  down  to  both  banks  (reset).  This  is  a  complete  functionality  test  pattern 
for  renters 

Out  Left  N 
Read  Left  M 
Out  Right  N 
Read  Right  M 
Reset  Right  M 
Reset  Left  M 
Write  N 
Data  N 

Write  1  to  left  bank  Write  1  to  right  bank  Reset  both  (Write  0) 

Fig.  1.3.8.  Low-speed  functionality  test  of  memory  register  cell  located  in  row  M  and  column 

N  (Fig.  1.3.6).  Some  of  the  input  signals  are  not  shown  for  clarity. 

With  this  pattern,  we  have  tested  all  24  memory  register  cells  and  8  port  register  cells 
consequently.  Table  1.3.1  shows  the  test  map,  i.e.  operational  margins  (in  percent)  on  dc  bias 
current  for  each  cell,  of  the  register  file  evaluation. 


Reg. 

Bit  (column  number) 

number 

0: 

1: 

2: 

3: 

4: 

5: 

6: 

7: 

0: 

-9/+24 

-12/+24 

-13/+24 

-8/+21 

-12/+21 

-14/+25 

-12/+24 

-7/+10 

1: 

-11/+24 

-10/+24 

-13/+24 

-8/+20 

-13/+22 

-14/+25 

-12/+20 

-5/+8 

2: 

-12/+21 

-14/+22 

-13/+20 

-12/+20 

-12/+20 

-14/+19 

-13/+22 

-2/+3 

3: 

-11/+23 

-11/+24 

-7/+10 

-5/+8 

-4/+6 

-3/44 

-1/+3 

-1/+2 

Table  1.3.1.  Register  File  Test  Map 


The  margins  variation  from  cell  to  cell  can  be  caused  by  many  reasons,  such  as  local  magnetic 
flux  trapping,  external  magnetic  field,  fabrication,  etc. 
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The  severe  operational  margins  degradation  at  the  right-bottom  part  of  the  array  has  been  caused 
by  proximity  of  the  dc  bias  current  injection  point. 
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II.  Energy  efficiency  improvement  of  SFQ  Circuits 

1.  ERSFO  -  Zero  Static  Power  Dissipation  RSFQ  logic 

Replacing  each  bias  resistor  with  a  Josephson  junction  as  a  current  distributing  element  seems 
to  be  a  natural  idea  1 .  The  Josephson  junction’s  critical  cmi'ent  Ic  is  a  natural  current-limiting 
nhenomenoti  When  a  shunted  t(T<l  1  Josenhsoti  junction  is  connected  to  a  vevv  small  ('V«LR„1 


Dc  voltage  source 
"feeding  JTL" 

V=  On*/ 

0  c  Dc  current 


Fig.  2.1.1.  The  biasing  scheme  of  ERSF Q 

The  necessary  condition  of  such  current  distribution  scheme  is  that  the  voltage  on  power  line 
should  be  equal  to  or  greater  than  the  maximum  possible  dc  voltage  in  the  powered  circuit.  For 
all  RSFQ  circuits  (with  the  exception  of  output  amplifiers  and  some  exotic  SFQ  pulse 
multipliers2),  the  maximum  possible  dc  voltage  is  Vmax=  <f>o  f dock ■  hi  order  to  create  such  a 
voltage  source,  we  use  a  Josephson  transmission  line  (JTL)  biased  by  common  with  the  circuit 


1  L.  Eaton  and  M.  Johnson.  “Superconducting  constant  current  source”.  US  Patent  #7,002.366.  issued  Feb.  21,  2006. 

2  V.  K.  Semenov,  “Digital  to  analog  conversion  based  on  processing  of  the  SFQ  pulses”.  IEEE  Tram.  Appl. 
Supercond..  vol.  3,  pp.  2637-2640.  Mar  1993 
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power  line  through  large  superconductive  inductors.  We  call  it  a  “feeding  JTL”,  for  its 
functionality  is  to  serve  as  an  additional  supply  of  bias  current.  By  applying  SFQ  pulses  from  the 
circuit’s  clock  source  to  the  feeding  JTL,  we  create  an  exact  dc  voltage  Vmax  on  the  bias  line. 

To  prevent  dynamic  current  redistribution  and  to  increase  the  impedance  of  the  local  bias 
current  source,  large  inductances  Lb  were  serially  connected  to  the  bias  junctions,  providing 
filtering  of  the  ac  components.  The  maximum  bias  current  dynamic  deviation  in  this  case  is  SI  < 
<P(/Lb.  At  Lb- 400  pH,  the  current  fluctuations  do  not  exceed  5  pA. 

The  circuit  needs  to  be  biased  with  the  dc  current  value  just  under  the  total  critical  current  of 
bias  junctions.  So,  in  the  passive  state  (when  the  clock  is  not  applied),  an  ERSFQ  circuit  does  not 
dissipate  any  power  at  all  (zero  static  power  dissipation).  After  turning  it  on,  i.e.  applying  a  clock 
from  the  clock  source,  the  total  power  dissipation  of  an  ERSFQ  circuit  becomes  P=Ib<P(fcik, 
where  Ib  is  the  total  bias  current  for  the  circuit  (from  the  dc  current  source)  and  fik  is  its 
operating  clock  frequency.  This  is  orders  of  magnitude  less  than  the  amount  of  power  dissipated 
in  traditional  RSFQ  circuits. 

Albeit  the  additional  power  dissipates  in  the  voltage  soiuce  (the  feeding  JTL),  its  value  varies 
between  zero  and  approximately  a  quarter  of  the  total  power  dissipation  in  the  ERSQ  circuit. 
This  value  depends  on  the  particular  design  of  the  circuit.  In  an  RSFQ  circuit,  clock  is 
transmitted  via  tree  of  clock  JTLs.  These  JTLs  as  good  voltage  soiuce  for  power  buss  as  the 
feeding  JTL.  To  our  estimate,  in  an  ERSFQ  circuit,  the  total  critical  current  of  all  clock  JTLs, 
including  the  feeding  JTL.  should  be  in  excess  of  -25%  of  the  total  dc  bias  current.  Some 
ERSFQ  circuits  may  not  require  a  feeding  JTL  at  all,  maintaining  dc  bias  voltage  with  their 
clock-distributing  JTLs. 


< - JTL  stage 


Bias  bus 

Bias  JJ 

RSFQ  gate 
(TFF) 


Fig.  2.1.2.  A  layout  fragment  of  an  ERSFQ  circuit. 
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One  can  easily  see  large  (-400  pH)  bias  inductors  consuming  substantial  space  on  a  chip. 


DFFC  microphotograph 


Lb 

Jb 


Fig.  2.1.3.  Microphotograph  of  ERSFQ  D  flip-flop  with  complementary  outputs. 

The  other  drawback  of  ERSFQ  is  its  expectedly  high  time  jitter  due  to  unavoidable  bias  current 
fluctuation  is  SI  <  <P(/Lb-  The  obvious  solution  to  that  is  increasing  value  Lb  of  a  bias  inductor 
and  generally  employing  pipeline  architecture  in  designing  large  circuits.  Fig.  2.1.3  shows  a 
microphotograph  of  the  fabricated  ERSFQ  circuit,  hi  order  to  obtain  large  inductance,  ground 
planes  were  removed  from  under  the  inductor. 

We  have  designed  and  fabricated  several  chips  in  order  to  benchmark  ERSFQ  technology  with 
logic  cells  designed  in  both  (ERSFQ  and  RSFQ)  standards.  This  was  done  to  experimentally 
verify  any  differences  in  operational  margins  of  the  two  technologies.  The  output  amplifiers  have 
a  separate  power  bus  and  are  designed  hi  standard  RSFQ  for  the  obvious  reason. 

The  chip  contains  two  (ERSFQ  and  RSFQ)  versions  of  a  D  flip-flop  with  complementary 
outputs  and  two  versions  of  a  static  frequency  divider  by  16.  For  the  sake  of  comparison,  both 
versions  of  the  cells  were  made  using  the  same  design  template  and  look  alike  (the  RSFQ  version 
of  each  cell  was  made  by  replacing  bias  JJs  with  resistors  hi  its  ERSFQ  counterpart).  The  chip 
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ai  ERSFQ  DFFC  LF  Operation  d 


Clock 


Inverted 

Output 


Direct 

Output 


Input 


Operating  at  LF  within  22%  dc  bias  margins 

Fig.  2.1.4.  ERSFQ  DFFC  gate  low-speed  test 

The  functionality  test  results  of  a  DFFC  gate  are  shown  in  Fig.  2.1.4.  The  circuit  operated 
within  ±22%  bias  current  margins.  The  operating  region  includes  the  case  when  the  total  bias 
current  exceeds  the  total  critical  current  of  bias  junction,  in  which  case  there  is  static  power 
dissipation. 
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ERSFQ  Static  Frequency  Divider  by  16 


Input 


0.73 

0.70 


Output 


Fig.  2.1.5.  w/rtMn  /2G%/  dcv,bi<a-isi  m  a  rg  i  n  s 


The  correct  operation  of  a  static  frequency  divider  is  shown  in  Fig.  2.1.5.  The  ERSFQ  version 
of  the  circuit  was  operating  within  ±26%  bias  current  margins.  For  some  reason,  the  margins 

t  t  /■»-*  V»  i  rrh  th  n  n  th  T  1 1  n  U  C  L  (  \  i  n  t4 

ERSFQ  Static  Frequency  Divider  by  220 


inn 

i,  i 

i  Is  So 

Ui 

uL  u 

20 

Fig.  2.1.6.  ERSFQ  static  frequency  divider  by  2  . 


An  good  circuit  for  verifying  ERSFQ  approach,  ■  each  TFF  works  at 
its  own  frequency 
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We  performed  the  direct  measurements  of  the  bit-error  rate  (BER).  The  block  diagram  of  this 
experiment  is  shown  in  Fig.  2.1.7.  In  this  experiment,  we  used  two  phase-locked  frequency 
generators;  one  for  the  high-frequency  clock  and  the  other  for  the  reference  signal.  The 
maximum  frequency  we  can  apply  to  the  chip  through  our  standard  cryoprobe  is  about  30  GHz. 
We  used  an  on-chip  double-rate  converter  to  double  the  clock  frequency.  So,  the  fust  stage  of  the 
frequency  divider  could  operate  at  60  GHz.  Then,  after  dividing  by  factor  of  2  ,  the  signal  goes 
through  the  output  amplifier  to  an  oscilloscope,  where  it  is  compared  with  the  reference  signal. 


The  circuit  worked  correctly  at  up  to  67  GHz  of  clock  frequency  within  ±16%  dc  bias  current 
margins.  This  shows  that  it  could  have  worked  at  much  higher  frequency,  and  33  GHz  is  just  a 
limit  of  our  high-i 
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Fi^.  2. 1. 7.  Experiment  on  high-speed  BER  measurement  of  ERSFQ  static  frequency  divider  by 
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2.  An  ERSFQ  8-bit  Adder  Family 

Using  ERSFQ  approach,  we  have  designed  two  versions  of  an  8-bit  parallel  adder  employing  a 
wave-pipelined  architecture.  The  adders  were  constructed  with  an  asynchronous-carry  half¬ 
adder  (HA)  cell.  This  cell  comprises  a  hybrid  between  an  asynchronous  Mille  element  and  a  B 
flip-flop.  It  is  insensitive  to  the  delay  between  inputs,  allowing  high-tlnoughput  operation  of  the 
adder. 

The  fust  design  (Fig.  2.2.1)  was  a  most  straight-forward  approach  imposed  by  the  aligned  data 
front  requirement  (i.e.  FSB  and  MSB  should  have  been  produced  simultaneously).  The  second 
reason  was  our  desire  (at  the  tune)  to  avoid  using  a  merger  cell  in  ERSFQ  design.  Further  on, 
we  will  refer  to  this  architecture  as  an  “aligned-front”  (AF)  adder,  hi  this  architecture,  clock 
signal  follows  the  data,  resulting  in  a  single-clock  operation.  For  the  described  above  half-adder 
cell,  clock  is  needed  for  producing  the  SUM  (XOR)  output,  while  the  CARRY  (AND)  output 
signal  is  being  generated  and  propagated  asynchronously. 

The  asynchronous  cany  propagation  and  a  specially  designed  clock  distribution  tree  provided, 
in  simulation,  a  high  throughput  (up  to  30  GHz)  with  800-ps  latency  (at  4.5-kA/cm"  process). 
The  adder  was  designed  for  20-GHz  target  throughput.  It  consists  of  36  Half  Adder  cells 
(-2.000  JJS),  and  dissipates  0.36  fJ  per  single  8-bit  addition  operation. 

Ripple-Carry  8-bit  Adder 


Fig.  2.2.1.  Schematics  of  the  “aligned- front”  8-bit  wave  pipeline  adder. 


The  obvious  drawback  of  this  design  is  its  large  size.  An  TV-bit  AF  adder  requires  N-(N+ 1)/2 
half  adder  cells.  This  results  hi  a  low-scale  integration  (ESI;)  and  in  a  liigh-value  total  dc  bias 
current  (i.e.  power  dissipation  in  ERSFQ). 
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After  solving  a  problem  with  an  ERSFQ  merger  cell,  we  have  designed  a  simple  ripple-carry 
adder  (Fig.  2. 1 .2).  In  this  case,  the  /Y-bit  adder  utilizes  only  2 -N  half  adder  cells.  It  still  operates 
in  the  wave-pipeline  mode  (i.e.  performs  single-clock  full  operation),  but  requires  a  heavy- 
skewed  input  data  front  for  operating  at  maximum  speed. 


Clock 


Fig.  2.1.2.  Schematics  of  the  ripple-carry  8-bit  wave  pipeline  adder. 


2 

For  the  4.5-kA/cm"  fabrication  process,  the  8-bit  RC  adder  operates  at  37-GHz  throughput. 
For  the  same  Jc,  the  latency  between  LSB-in  and  LSB-out  is  44  ps,  and  the  total  latency  (between 
LSB-in  and  MSB-out)  is  960  ps.  The  RC  adder  dissipates  only  0.16  fJ  per  single  8-bit  addition 
operation. 

Initially,  the  8-bit  AF  adder  was  designed  and  fabricated  in  old  HYPRES’s  4-layer  1.0-pm  4.5- 
kA/cm “  process.  Then,  with  the  successful  development  of  HYPRES’s  advanced  process,  we 
have  redesigned  the  chip  with  new  (6  metal  layers,  0.25  pm  litho)  design  rules. 

The  advanced  process  allowed  scaling  down  our  cell  library.  This  process  is  especially 
beneficial  for  ERSFQ  designs.  Utilizing  two  extra  wiring  planarized  layers,  we  placed  all  space¬ 
consuming  bias  inductors  under  the  ground  plane.  This,  together  with  the  lithography 
improvement,  results  in  a  seven- fold  area  reduction  of  an  ERSFQ  design. 

Fig.  2.1.3  shows  microphotograph  of  the  8-bit  ERSFQ  adder  placed  on  a  5x5-mm  chip 
fabricated  in  standard  (a)  and  advanced  (b)  HYPRES’  process.  The  difference  in  the  circuit  area 
is  6.6  times. 
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Standard  HYPRES’  process: 

□  1. 0/4.5  kA/cm2 
□4  metal  layers 

□  1.0  um  feature 


Enhanced  HYPRES’  process: 
□4.5/20.0  kA/cm2 
□6  metal  layers 
□0.25  um  feature 


a)  6.6  times  reduction  ^circuit  area 

Fig.  2.1.3.  Microphotograph  of  the  8-bit  aligned-front  adder  on  a  5x5  nun  chip  fabricated  in 
standard  (a)  and  advanced  (b)  HYPRES’ process. 


After  that,  we  have  designed  a  similar  adder  for  HYPRES’s  0.25-pm  4-layer  process 
(Fig.  2.1.4a)  in  order  to  compare  the  test  results.  Thus,  the  ERSFQ  8-bit  adder  family  has 
become  our  benchmark  for  evaluating  a  fabrication  process. 
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Fig.  2.1.4.  The  8-bit  RC  adder  on  a  5x5-mm  chip  designed  for  HYPRES’s  0.25-pm  4-layer 
process 

The  test  methodology  and  test  patterns  are  the  same  for  all  flavors  of  parallel  8-bit  adders. 
Because  of  that,  here,  we  will  describe  test  procedure  without  specifying  the  type  of  adder. 
Then,  we  will  report  and  discuss  test  results  specific  for  each  fabrication  process. 

The  functionality  test  at  low  speed  allows  comparably  easy  evaluation  of  a  chip  for  the  high¬ 
speed  test  pre-selection.  It  also  provides  a  valuable  feedback  to  the  foundry  on  the  yield. 

Using  automated  test  setup,  we  have  performed  the  complete  functionality  test  of  the  adder  at 
low  speed.  We  have  measured  the  dc  bias  current  margins  for  all  2 16  combinations  of  arguments 
A  and  B  of  the  8-bit  adder.  This  comprehensive  test  takes  a  long  time  and  has  to  be  run 
overnight.  Later,  we  have  decided,  that  testing  critical  carry  patterns  only  would  be  sufficient  for 
the  functionality  test. 
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Fig.  2.1.5.  Functionality'  test  of  the  8-bit  ERSFQ 


adder. 


An  ostentations  test  pattern  is  shown  in  Fig.  2.1.5.  This  pattern  has  been  created  for  the  sake  of 
visualization  of  the  adder  functionality. 

In  order  to  perform  high-speed  test,  we  used  built-in  switch-based  test  bed.  This  approach  also 
allows  us  to  measure  the  bit-error  rate  of  the  circuit  at  high  speed.  The  idea  is  to  produce  input 
data  signals  by  passing  clock  pulses  through  SFQ  relays  controlled  by  external  dc  current.  If  the 
control  current  is  ON,  the  clock  pulse  passes  through  the  relay,  providing  “1”  on  the 
corresponding  input,  and  “0”  otheiwise.  Thus  we  can  apply  the  same  input  pattern  repeatedly  at 
a  high-speed  clock  rate.  The  output  data  stream  is  monitored  on  toggling-type  SFQ-to-DC 
converters.  When  the  output  is  “0”  (no  pulses  coming  to  the  converter),  the  output  will  be  a  static 
voltage  level  (0  or  Vmax).  If  a  high-speed  stream  of  SFQ  pulses  is  coming  to  the  converter,  it 
oscillates  with  the  clock  frequency  resulting  in  average  voltage  level  between  two  voltage  states 
(~  Vmax  /  2).  On  the  oscilloscope,  the  first  (“0”)  case  will  be  represented  with  a  double  line,  and 
the  second  (“1”)  -  as  a  single  “fat”  line.  Not  eveiy  circuit  can  be  tested  with  this  method,  but  the 
parallel  adder  is  quite  suitable  for  this. 
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Fig.  2.1.6.  A  20-GHz  test  of  the  8-bit  ERSFQ  adder.  Argument  A  is  127,  while  B  is  alternating  0 
and  1. 


Using  this  technique,  we  have  perfonned  various  tests.  The  most  critical  timing-wise  parallel 
adder  operation  is  a  CARRY  propagation  throughout  the  whole  circuit.  In  case  of  8 -bit  parallel 
adder,  this  is  operation  “255+1”  or  “127+1”.  Fig.  2.1.6  shows  the  correct  work  at  20  GHz  at 
alternating  operations  “127+0”  and  “127+1”. 

The  more  visually  interesting  pattern  is  the  pattern  with  all  possible  8-bit  numbers.  The  test 
results  of  such  pattern  require  a  complicated  interpretation.  Fig.  2.1.7  illustrates  how  to  apply  any 
8-bit  number  pattern  simply  using  8  synced  generators.  Here,  one  of  the  adder  arguments  is  equal 
to  0,  while  the  other  scans  through  all  possible  8-bit  numbers.  So,  the  operation  shown  in 
Fig.  2.1.7  is  A+0=A  for  all  256  combinations.  After  applying  argument  B,  as  alternating  “0”  and 
“1”,  we  have  obtained  Fig.  2.1.8.  The  arrow  points  at  the  correct  operation  “127+1”. 
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All  256  8-bit  numbers  are  assigned  to  A,  B=0 

so 
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ERSFQ  8-bit  Adder  20-GHz  test 


Addition  of  A  and  alternating  B  (0/1) 


Fig.  2.1.8.  A  20-GHz  test  of  the  8-bit  A  consequently  takes  all  256 

numbers,  while  B  is  alternating  0  and  1. 
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Using  a  high-frequency  oscilloscope,  we  have  directly  measured  latency  of  the  AF  adder.  The 
delay  between  clock  and  the  output  signal  is  900  ps.  That  is  slightly  larger  than  predicted  by 
simulation  (800  ps). 

The  structure  of  RSFQ  logic  allows  a  quite  simple  measurement  of  the  power  dissipating  by  an 
ERSFQ  circuit.  All  RSFQ  circuits  are  biased  with  the  current  from  an  external  dc  current  source. 
Therefore,  the  dissipating  power  equals  to  a  product  of  the  dc  bias  current  value  and  the 
measured  average  voltage  on  the  bias  line.  Naturally,  the  measured  voltage  depends  on  the  clock 
frequency  of  the  circuit.  In  case  of  ERSFQ,  it  is  a  direct  Josephsou  relation  (V-fcik' ®o)- 
Fig.  2.1.9  shows  the  results  of  such  measurement  for  both  AF  and  RC  adders.  The  experimental 
results  are  in  ideal  agreement  with  theoretical  prediction.  As  can  be  seen,  the  power  dissipation 
of  the  AF  adder  at  20  GHz  is  ~  7.2  pW.  That  gives  us  the  theoretically  predicted  0.36  0  per 
single  clock.  ial  bias 

current  (~3.2  p 


0  2  4  6  8  10  12  14  16  18  20  22 

Clock  Frequency  (GHz) 


Fig.  2.1.9.  Experimentally  measured  power  dissipation  of  the  S-bit  ERSFQ  AF  (solid  line)  and 
RC  (dashed  line)  adders  vs.  operating  frequency. 
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3.  Redesigning  Arithmetic  Logic  Unit  (ALU)  in  ERSFQ 

While  redesigning  the  existing  Kogge-Stone  adder  based  8-bit  ALU  in  ERSFQ  technology,  we 
have  encountered  a  problem  with  the  size  of  its  physical  layout.  The  size  of  the  ALU  was  too 
large  to  be  integrated  with  the  Register  File  and  the  Instruction  Decoder  on  one  chip.  So,  we 
have  faced  the  necessity  of  changing  ALU  architecture  to  more  compact  version  without  losing 
its  performance  and  with  the  same  specifications. 

The  previous  ALU  architecture  (developed  by  subcontractors  from  Stony  Brook  University)  was 
based  on  a  Koaae-Stone  tvne  adder  (see  previous  nroaress  reports).  The  advantaaes  of  this 
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Fig.  2.3.1.  Block-diagram  of  the  8-bit  ERSFQ  ALU. 


The  base  element  of  the  ALU  is  an  ERSFQ  Half  Adder  cells  (Fig.  1.2).  This  cell  has  been 
successfully  used  in  ERSFQ  8-bit  adder,  in  DSP  processor,  and  in  various  ADCs.  The  main 
feature  of  this  cell  is  its  asynchronous  Cany  signal.  Meaning,  a  Carry  signal  is  not  being  latched 
to  a  Clock  signal  and  is  being  produced  as  soon  as  the  second  argument  “1”  is  arrived  before  the 
Reset  (clock).  This  property  allows  us  propagating  Cany  signals  in  a  form  of  a  wave. 
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Fig.  2.3.2.  ERSFQ  Half Adder  cell.  Schematics  and  functionality  test. 

The  instruction  select  is  achieved  by  a  switch  cell  that  relays  Sum  and  Cany  signals  from  the 
first  stage  of  Half  Adders  to  the  second  staee.  This  simple  routine  provides  execution  of  such 
instructioi  he  operands, 

this  gives 


—  Port  &  buffer 

k  it  it  n  it  n 

Address  Instruction 


Fig.  2.3.3.  The  ERSFQ  processor  datapath  timing  scheme. 


In  the  new  ALU  architecture  (Fig.  2.3.1),  we  exploited  advantages  of  local  timing  featuring 
RSFQ  technology.  The  idea  of  new  wave-pipelined  ERSFQ  ALU  is  in  propagating  an 
distraction  code  and  a  clock  signal  in  sync  from  LSB  to  MSB  of  the  operands.  The  same 
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“skewed”  approach  (Fig.  2.3.3)  is  used  in  reading  from  and  writing  to  a  register  as  well, 
providing  extremely  high  throughput.  A  small  “vertical”  size  of  the  ALU,  i.e.  a  short  distance 
between  input  and  output  terminals,  provides  very  low  latency  (simulated  -  80  ps).  Flere,  under 
“latency”,  we  assume  a  “turnaround”  time  between  the  start  of  loading  operands  LSBs  and 
receiving  a  result  LSB.  The  wave  propagation  time  from  LSB  to  MSB  is  400  ps  for  the  8 -bit 
ALU,  but  it  does  not  affect  the  Datapath  performance  and  therefore  should  not  be  regarded  as 
ALU  latency.  The  simulated  throughput  of  the  8-bit  ALU  is  44  GHz  in  4.5-kA/cm"  fabrication 
process. 


Fig.  2.3.4.  A  5x5 -mm  chip  with  the  prototype  of  8-bit  ERSFQ  ALU. 


Finally,  we  have  designed  an  8-bit  ALU  prototype  on  a  5  mm  by  5mm  chip  (Fig.  2.3.4).  The 
size  of  the  ERSFQ  ALU  is  three  times  smaller  than  the  size  of  previous  RSFQ  8-bit  ALU, 
although  ERSFQ  cells  occupy  more  room  than  their  RSFQ  counterparts,  because  of  the  biasing 
scheme. 
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I.  Redesigning  Arithmetic  Logic  Unit  (ALU)  in  ERSFQ 

While  redesigning  the  existing  Kogge-Stone  adder  based  8-bit  ALU  in  ERSFQ  technology,  we 
have  encountered  a  problem  with  the  size  of  its  physical  layout.  The  size  of  the  ALU  was  too 
large  to  be  integrated  with  the  Register  File  and  the  Instruction  Decoder  on  one  chip.  So,  we 
have  faced  the  necessity  of  changing  ALU  architecture  to  more  compact  version  without  losing 
its  performance  and  with  the  same  specifications. 

The  previous  ALU  architecture  (developed  by  subcontractors  from  Stony  Brook  University)  was 
based  on  a  Koaae-Stone  tvne  adder  (see  previous  proeress  reports).  The  advantaaes  of  this 
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Fig.  2.3.2.  Block-diagram  of  the  8-bit  ERSFQ  ALU. 


The  base  element  of  the  ALU  is  an  ERSFQ  Half  Adder  cells  (Fig.  1.2).  This  cell  has  been 
successfully  used  in  ERSFQ  8-bit  adder,  in  DSP  processor,  and  in  various  ADCs.  The  main 
feature  of  this  cell  is  its  asynchronous  Cany  signal.  Meaning,  a  Can  y  signal  is  not  being  latched 
to  a  Clock  signal  and  is  being  produced  as  soon  as  the  second  argument  “1”  is  anived  before  the 
Reset  (clock).  This  property  allows  us  propagating  Cany  signals  in  a  form  of  a  wave. 
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Fig.  2.3.2.  ERSFQ  Half Adder  cell.  Schematics  and  functionality  test. 

The  instruction  select  is  achieved  by  a  switch  cell  that  relays  Sum  and  Cany  signals  from  the 
first  stage  of  Half  Adders  to  the  second  staee.  This  simple  routine  provides  execution  of  such 
instructioi  he  operands, 

this  gives 


—  Port  &  buffer 

k  it  it  n  it  n 

Address  Instruction 


Fig.  2.3.3.  The  ERSFQ  processor  datapath  timing  scheme. 


In  the  new  ALU  architecture  (Fig.  2.3.1),  we  exploited  advantages  of  local  timing  featuring 
RSFQ  technology.  The  idea  of  new  wave-pipelined  ERSFQ  ALU  is  in  propagating  an 
instmction  code  and  a  clock  signal  in  sync  from  LSB  to  MSB  of  the  operands.  The  same 
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“skewed”  approach  (Fig.  2.3.3)  is  used  in  reading  from  and  writing  to  a  register  as  well, 
providing  extremely  high  throughput.  A  small  “vertical”  size  of  the  ALU,  i.e.  a  short  distance 
between  input  and  output  terminals,  provides  very  low  latency  (simulated  -  80  ps).  Flere,  under 
“latency”,  we  assume  a  “turnaround”  time  between  the  start  of  loading  operands  LSBs  and 
receiving  a  result  LSB.  The  wave  propagation  time  from  LSB  to  MSB  is  400  ps  for  the  8 -bit 
ALU,  but  it  does  not  affect  the  Datapath  performance  and  therefore  should  not  be  regarded  as 
ALU  latency.  The  simulated  throughput  of  the  8-bit  ALU  is  44  GHz  in  4.5-kA/cm"  fabrication 
process. 


Fig.  2.3.4.  A  5x5 -mm  chip  with  the  prototype  of  8-bit  ERSFQ  ALU. 


Finally,  we  have  designed  an  8-bit  ALU  prototype  on  a  5  mm  by  5mm  chip  (Fig.  2.3.4).  The 
size  of  the  ERSFQ  ALU  is  three  times  smaller  than  the  size  of  previous  RSFQ  8-bit  ALU, 
although  ERSFQ  cells  occupy  more  room  than  their  RSFQ  counterparts,  because  of  the  biasing 
scheme. 
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4.  Energy  Efficiency  Maximization  of  SFQ  Circuits 


Summary: 

During  this  project  period,  we  were  developing  approaches  to  maximize  energy-efficiency  of 
SFQ  digital  circuits.  We  perfonned  the  first  experimental  demonstration  of  recently  proposed 
another  energy-efficient  single  flux  quantum  logic  with  zero  static  power  dissipation,  eSFQ.  We 
also  demonstrate  that  the  introduction  of  passive  phase  shifters  allows  the  reduction  of  dynamic 
power  dissipation  by  about  20%,  reaching  ~0.8  aJ  per  bit  operation.  Two  types  of  demonstration 
eSFQ  circuits,  shift  registers  and  demultiplexers  (deserializers),  were  implemented  using  the 
standard  HYPRES  4.5  kA/cnr  fabrication  process. 

Scope 

In  this  project,  we  achieved  the  first  experimental  demonstration  of  eSFQ  circuits  including  shift 
registers  and  demultiplexers.  These  eSFQ  circuits  make  use  of  superconducting  dc  bias  current 
dividers  and  thus  avoid  static  power  dissipation.  Until  recently,  this  was  considered  impossible 
in  RSFQ-type  circuits,  since  it  would  lead  to  superconducting  phase  and  average  voltage 
imbalances  caused  by  data  SFQ  propagation  in  superconducting  Josephson  circuits.  In  eSFQ 
circuits,  all  RSFQ  core  advantages  of  high-speed,  dc  power,  internal  memory,  local  clock  control 
along  with  the  already  developed  RSFQ  circuit  designs  are  largely  preserved.  The  elimination  of 
static  power  dissipation  in  eSFQ  circuits  results  in  over  two  orders  of  magnitude  reduction  of 
overall  circuit  power  as  compared  to  conventional  RSFQ  circuits. 

We  also  were  able  to  reduce  dynamic  power  dissipation  of  SFQ  gate.  This  is  achieved  by 
employing  passive  superconducting  phase  shifters  resulting  in  reduction  of  circuit  bias  current 
and,  thus,  dynamic  power  dissipation.  Furthermore,  since  the  eSFQ  circuit  dc  bias  is  controlled 
by  the  SFQ  clock,  one  can  manage  dynamic  power  by  managing  the  distribution  of  SFQ  in  the 
clock  network.  It  is  possible  to  turn  off  the  SFQ  clock  for  a  particular  part  of  a  processor  and 
effectively  stop  the  circuit  operation,  i.e.  achieve  zero  dynamic  power,  while  maintaining  the 
internal  state  of  the  affected  circuit.  This  feature  is  compatible  with  the  coveted  goal  of 
achieving  energy-proportional  computing  as  the  ultimate  energy-efficient  machine. 

Design  of  eSFQ  demonstrator  circuits 

Most  RSFQ  circuits  are  generally  well-suited  to  conversion  to  eSFQ.  D-cell  (D  flip-flop) 
conversion  from  RSFQ  to  eSFQ  is  depicted  in  Fig.  2.4.1.  Conventionally,  the  D-cell  is  biased  so 
that  it  initially  stores  a  logic  “0”.  Such  biasing  occurs  on  junction  ,  which  only  experiences  a 
phase  increment  when  a  pulse  arrives  at  In.  However,  exactly  one  pulse  arrives  at  Clock 
during  every  clock  period,  ensuring  a  phase  increment  across  the  Decision  Making  Pair 
(DMP).  Hence,  for  eSFQ,  the  bias  injection  point  is  moved  to  the  DMP.  As  a  side  effect,  the 
converted  D-cell  stores  a  logic  “1”  after  initial  bias  ramp-up. 
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Fig.  2.4.1.  Conversion  of  standard  RSFQ  D  flip-flop  to  eSFQ. 


We  chose  two  circuits  for  the  first  experimental  demonstration  of  eSFQ  approach:  a  shift  register 
and  a  demultiplexer  (deserializer).  A  shift  register  is  a  typical  benchmark  circuit  used  on 
assessment  of  a  new  circuit  technology.  It  is  also  widely  used  in  digital  and  mixed-signal 
circuits.  Both  of  these  circuits  are  quite  suitable  for  implementation  using  the  eSFQ  approach,  as 
they  are  clocked  and  therefore  naturally  coupled  to  an  SFQ  clock  distribution  network. 

Circuit  design  and  analysis  of  performance  metrics  were  achieved  with  a  pre-release  version  of 
the  NioPulse  software  suite,  whereas  LAS  I  7  software  was  employed  for  cell  and  chip  layout. 
Circuit  extraction  and  verification  were  done  with  the  InductEx  package. 

Two  different  eSFQ  versions  of  these  shift  registers  were  implemented:  a  straight-forward 
conversion  from  RSFQ  termed  “eSR”  and  a  version  with  an  additional  magnetic  flux  bias, 
“MeSR”.  The  MeSR  design  is  shown  to  have  higher  margins  retaining  the  high  speed 
conventional  RSFQ  design.  It  also  has  the  potential  to  achieve  lower  bias  current  and  thus, 
considering,  lower  dynamic  power  dissipation. 

The  eSFQ  shift  register  cell,  eSR,  is  depicted  in  Fig.  2.4.2(a).  Its  topology  can  be  partitioned 
into  two  sections:  clock  and  data.  Junctions  and  the  DMP  (  )  make  up  the  clock 

section,  which  transmits  the  clock  and  interrogates  the  DMP.  Note  that  we  used  a  counter-flow 
clocking  scheme,  as  this  has  generally  yielded  higher  margins  in  conventional  RSFQ  designs.  If 
switches,  the  clock  pulse  is  simply  transmitted  from  CIn  to  COut.  If  switches  instead,  the 
clock  pulse  is  transmitted  and  an  output  pulse  is  generated,  which  exits  at  DOut.  In  either  case, 
the  phase  increases  by  .  The  only  bias  current  injection  point  for  eSR  is  at  point  . 
Inductors  and  determine,  to  a  large  extent,  the  bias  current  distribution  between 

and  the  DMP  (somewhat  skewed  by  Josephson  inductances  and  parasitics).  Junctions  and 
make  up  the  data  section.  After  bias  current  ramp-up,  is  biased,  whereas  is  not,  which 
corresponds  to  the  cell  storing  a  “1.”  After  the  first  clock  signal,  the  bias  current  redistributes  to 
,  which  corresponds  to  the  cell  storing  a  “0”.  A  pulse  appears  at  the  output,  representing  the 
initially  stored  “1”  as  depicted  in  Fig.  2.4.2(b). 


49 


HYPRES  Final  Report  W911NF-09-C-0036  (proposal  #55336PHQC) 


Bias  line 


Bias  line 


_r 


(b) 


1 — l — L 


1 


02  04  06  01 

Time  [ns] 


(a)  (c) 

Fig.  2.4.2.  eSR  -  eSFQ  shift  register  cell:  Schematic  (parasitics  are  omitted)  (a),  typical  configuration 
illustrating  the  counterflow  clock  (b)  and  simulated  cell  operation  (c).  Circuit  parameters  for  (a): 
Inductances:  LI:  2 pH,  L2: 1.9 pH,  L3: 1.2 pH,  L4:  2.1  pH,  L5:  2.1  pH,  L6:  3.5 pH,  L7:  3.4 pH,  LSI: 
10.S  pH,  Lb:  10  pH.  Critical  currents:  Jcl:  213  pA,  Jc2:  250  uA,  Jl:  313  pA,  J2:  1SS  pA,  Jd:  225  pA, 
Jb:  575  pA.  All  junctions  except  Jl  are  critically  damped  (  ).  Jl  has 


Maigins  of  operation  of  the  circuit  were  determined  for  a  4-bit  shift  register  configuration.  The 
critical  parameter  was  identified  as  the  critical  current  of  junction  ,  the  upper  (escape)  junction 
of  the  DMP.  One  of  the  reasons  for  this  is  the  injection  of  bias  current  through  the  DMP  as 
required  in  accordance  with  the  eSFQ  biasing  scheme.  The  difference  between  the  biased  and 
unbiased  DMP  is  evident  from  the  phases  of  the  DMP  junctions  shown  in  Fig.  2.4.3. 

The  grounded  junction  in  a  DMP  (  )  switches  when  it  is  biased,  whereas  the  escape  junction  (  ) 
switches  when  the  grounded  junction  is  not  biased.  The  escape  junction  is,  conventionally, 
not  biased.  Hence,  when  is  unbiased  as  it  is  in  case  of  RSFQ  (ERSFQ)  circuits,  both  junctions 
have  a  phase  near  zero,  and  as  has  a  lower  critical  current,  and  is  closer  to  the  source  of  the 
interrogating  pulse,  it  switches  in  the  presence  of  an  interrogating  pulse.  When  is  biased,  its 
phase  is  nearly  critical,  with  the  phase  of  remaining  near  zero,  making  the  switching 

junction  when  the  DMP  is  interrogated.  When  biasing  through  the  DMP  as  in  eSFQ,  the  escape 
junction  is  permanently  biased.  This  fixes  its  operating  point  phase  at  greater  than  zero, 
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increasing  its  affinity  to  switch,  particularly  as  it  is  closer  to  the  source  of  the  interrogating  pulse 
(the  Clock  node).  This  reduces  the  difference  between  the  steady-state  phases  of  and  when 
is  biased.  In  this  case,  the  increased  switching  affinity  of  is  undesirable  (as  should  then 
switch)  and  results  in  lower  parameter  margins  for 


<fri  W  o 


Fig.  2.4.3.  Comparison  of  RSFQ  (ERSFQ)  and  eSFQ  biasing:  Injecting  bias  current  in  the 
conventional  DFF  leaves  the  upper  junction  of  the  DMP  unbiased  (a).  Moving  the  bias  current 
injection  point  to  the  DMP  forces  the  phase  in  both  DMP  junctions  in  the  same  direction  during  ramp- 
up  (b).  (Colour  may  appear  only  in  the  online  journal.) 


In  order  to  improve  the  margins  of  ,  it  was  designed  in  an  underdamped  configuration  with 
.  An  underdamped  junction  exhibits  lower  switching  speeds  and  is  thus  less  likely  to 
switch  before  when  the  DMP  receives  the  interrogating  pulse.  With  the  underdamped  ,  a  4- 
bit  configuration  of  eSR  achieved  critical  margins  of  and  bias  margins  of  %,  with 

bias  margin  relating  to  the  bias  of  the  entire  4-bit  test  structure  and  critical  parameter  the  area  of 
for  all  four  eSR  cells. 

Although  underdamping  achieves  the  goal  of  increased  parameter  margins,  it  has  the 
undesired  side-effect  of  increasing  data-dependent  clock  skew.  Since  switches  slower  than  , 
the  clock  propagates  through  the  shift  register  faster  when  the  stored  bits  are  primarily  “l”s,  and 
slower  if  the  stored  bits  are  primarily  “0”s.  This  effect  is  seen  also  in  conventionally  biased  shift 
registers,  although  to  lesser  extent.  In  conventionally  biased  shift  registers,  this  effect  is  due  to  an 
underbiased  escape  junction  .  In  eSR,  it  is  due  to  the  slow-down  imposed  for  eSFQ  biasing. 
Compared  to  critical  damping,  for  as  used  above,  the  characteristic  time  of  is 

doubled,  potentially  halving  the  maximum  clock  frequency  achievable. 

Hence  it  was  desirable  to  have  all  junctions  equally  shunted  with  and  therefore,  having 

the  same  junction  speed,  achieving  the  maximum  frequency  for  a  typical  4.5  kA/cm2  critical 
current  density.  To  accomplish  this  without  punishingly  narrow  parameter  margins,  one  might 
investigate  several  options.  For  correct  operation  an  interrogating  pulse  must  not  cause  to 
switch  when  is  biased.  At  first  glance,  keeping  the  critical  current  of  low  should  achieve 
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this.  Considering  Fig.  2.4.2,  when  the  bias  current  enters  at  node  ,  some  travels  down  the  DMP, 
biasing  junction  .  After  crossing  ,  the  bias  current  divides  again  at  node  .  The  alternate  path 
to  ground  through  means  that  some  of  the  bias  current  leaks  away  from  ,  ensuring  that 
always  receives  less  bias  current  than  .  Lowering  the  critical  current  of  increases  its 
Josephson  inductance,  which  exacerbates  the  leakage  effect. 


Bias  line 


Bitis  line 


0  02  04  06  08 


Time  [ns] 


(a)  (c) 

Fig.  2.4.4.  MeSR  -  eSFQ  shift  register  cell  with  magnetic  flux  bias:  Schematic  of  eSFQ  shift  register 
cell  with  flux  bias  (parasitics  are.  omitted)  (a);  a  typical  configuration  illustrating  the  counterflow 
clocking  scheme  (b);  simulated  operation  (c).  The  flux  bias  ramps  up  from  0.1ns  to  0.2ns,  clearly 
evident  in  the  current  trace  for  .  After  flux-bias  ramp-up,  MeSR  is  non-storing.  Circuit  parameters 
for  (a):  Inductances:  LI:  2.2  pH,  L2:  1  pH,  U:  1.8 pH,  L4:  2.0 pH,  L5:  1.8 pH,  L6:  2.8 pH,  L7:  4.2 
pH,  LSI:  11.2  pH,  Lf:  11.5 pH,  k:  0.25,  Lb:  10  pH.  Critical  currents:  Jcl:  188  uA,  Jc2:  188  fiA,  Jl: 
288  uA,  J2:  200  uA,  Jd:  1 63  uA,  Jb:  525  uA.  All  junctions  are  critically  shunted. 


A  magnetically  introduced  corrective  flux  bias  was  used  to  solve  the  leakage  problem,  resulting 
in  cell  MeSR,  depicted  in  Fig.  2.4.4.  The  dc  flux  bias,  introduced  through  ,  forces  the  current 
in  the  storage  loop  to  redistribute  as  intended,  opposing  the  leakage  effect.  In  this  way,  the  phase 
offset  of  (as  a  result  of  the  eSFQ  bias  current)  can  be  modified.  Dming  circuit  optimization,  it 
became  apparent  that  using  the  flux  bias  to  redirect  initial  bias  current  from  to  was  most 
effective  at  maximizing  parameter  margins.  A  potential  advantage  of  this  is  that  shift  registers 
based  on  MeSR  initially  store  a  “0”  which  aligns  well  with  conventional  RSFQ  shift  registers. 
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A  further  advantage  of  the  flux  bias  manifests  itself  in  reduced  bias  current  requirements  in  terms 
of  injected  bias  current  (and  corresponding  decrease  in  dynamic  power  dissipation).  As 
undesired  leakage  can  be  avoided  and  the  desired  balance  in  the  storage  loop  be  established 
using  the  flux  bias,  less  bias  current  needs  to  be  injected  at  the  injection  point  .  Initial  “0” 
storage  and  reduced  bias  current  requirements  are  shown  in  Fig.  2.4.5.  Note  that  one  flux  bias 
line  is  required  to  bias  the  entire  shift  register,  irrespective  of  its  length.  For  a  4-bit  MeSR-based 
shift  register,  a  critical  margin  of  ,  and  bias  margin  of  were  achieved. 

The  presence  of  the  additional  flux  bias  line  might  appear  as  a  complication,  hi  reality,  a 
constant  flux  bias  can  be  implemented  in  a  variety  of  ways  ranging  from  a  small  superconducting 
loop  with  ffozen-in  SFQ  to  a  ^-junction  implemented  using  superconducting-ferromagnetic¬ 
superconducting  (SFS)  Josephson  junctions.  Even  for  conventional  RSFQ  circuits,  the  unproved 
operational  margins,  bit-error  rates,  and  gate  memory  non-volatility  were  repotted. 

The  reduction  of  bias  current  directly  translates  into  the  reduction  of  dynamic  power  dissipation 
as  .  This  makes  the  magnetic  bias  approach  especially  valuable.  As  magnetic  flux 

bias  is  a  passive  non-switching  element,  it  does  not  contribute  to  power  dissipation.  As  is  evident 
from  the  results  of  simulations  for  20  GHz  clock,  the  eSR  shift  register  consumes  ~  1 .0  aJ/bit, 
while  the  MeSR  shift  register  consumes  ~  0.8  aJ/bit.  These  energies  correspond  to  the  centre  of 
the  bias  current  operational  region.  At  the  lower  limit,  the  energy  per  bit  operation  reaches  ~  0.5 
aJ/bit. 
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Fig.  2.4.5.  Results  of  simulation  with  a  20  GHz  clock  for  1.0  aJ/bit  eSR-  (a)  and  0.S  aJ/bit  MeSR-based 
(b)  16-bit  eSFQ  shift  registers  at  bias  current  corresponding  to  centre  of  operational  region.  Bias 
currents  distribute  correctly,  with  acceptable  distortion  through  switching  e\-ents.  For  the  eSR-based  shift 
register,  16  output  bits  are  immediately  observed  after  starting  the  clock.  In  both  cases,  the  input  pattern 
is  reproduced  at  the  output.  Lower  bias  current  requirements  are  e\’ident  for  the  magnetic  flux-biased 
shift  register. 
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Fig.  2.4.6.  eDES  -  eSFQ  deserializer  cell  with  magnetic  Jinx  bias.  Schematic  of  eDES  cell  (a);  a 
deserializer  configuration  (b);  the  operation  of  eDES  (c).  Flux  bias  is  ramped  up  from  0.1  ns  to  0.2 
ns,  as  evident  in  the  bottom  frace.  Initially,  eDES  is  non-storing,  but  when  a  pulse  arrives  at  Din,  it  is 
stored,  readable  by  both  a  Shift  or  a  Read  pulse.  Simulated  4-bit  operation  of  the  deserializer  at  20 
GHz  (d).  Clearly,  the  input  signal  (a  count  from  0  to  15)  is  parallelized,  resulting  in  4  output  streams. 
Circuit  parameters  for  (a):  Inductances:  Lis:  2.9  pH,  L2s:  2.0  pH,  Lis:  1.4  pH,  L4s:  1.2  pH,  L5s:  0.2 
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pH,  L6s:  0.9 pH,  L7s:  4.2 pH,  L8:  2.0 pH,  LSI:  3.1  pH,  Lbs:  10 pH,  Lf:  16.1  pH,  k:  0.24,  Llr:  1. 7 pH, 

L2r:  1.7 pH,  L3r:  1.6 pH,  L4r:  2.5 pH,  L5r:  0.2 pH,  L6r:  0.6 pH,  L7r:  3.4 pH,  Lbr:  10 pH.  Critical 
currents:  Jcls:  163  pA,  Jc2s:  188  pA,  Jls:  288  pA,  Jd2s:  188  juA,  Jdes:  200  pA,  Jdl:  150  pA,  Jbs: 

500  juA,  Jclr:  188  pA,  Jc2r:  188  pA,  Jlr:  225  pA,  Jd2r:  188  pA,  Jder:  138  pA,  Jbr:  500  pA.  All 
junctions  are  critically  shunted. 

RSFQ  deserializers  (demultiplexers)  generally  follow  two  different  approaches:  a  binary  tree  or  a 
shift-and-dump  architectures.  For  conversion  to  eSFQ,  we  chose  the  latter  approach  as  it  has 
found  more  applications  in  practical  circuits  due  to  its  high  modularity  and  simple  timing.  Our 
eSFQ  deserializer  is  based  on  a  dual-port  D  flip-flop  or  D  -cell,  which  is  a  derivation  of  the  B 
flip-flop.  One  port  is  intended  for  serial  shifting  of  data,  the  other  for  parallel  readout.  An  -bit 
deserializer  divides  a  serial  stream  of  bits  into  parallel  streams. 

The  designed  eSFQ  deserializer  cell,  eDES,  is  depicted  in  Fig.  2.4.6.  The  two  readout  ports  are 
topologically  symmetrical,  both  achieving  destructive  readout  of  stored  flux.  Note  the  additional 
escape  junction  in  each  readout  arm  (  ).  The  deserializer  cell  contains  two  DMPs, 

suggesting  two  bias  injection  points.  As  in  MeSR,  a  flux  bias  is  employed  in  the  data  section  to 
achieve  the  desired  bias  current  distribution  between  .  All  junctions  were  designed  to 

be  critically  shunted.  When  correctly  biased,  eDES  stores  a  “0”  after  ramp-up. 

There  are  essentially  two  clocks  that  thread  each  deserializer  cell.  The  symmetry  of  the  cell  and 
size  of  the  limiting  junction  means  that  the  per-bit  switching  energy  required  by  the  shift 
operation  as  well  as  the  read  operation  is  comparable  to  that  of  the  MeSR-based  shift  register. 
For  normal  operation  the  ratio  of  the  clock  frequencies  depends  on  the  length  of  the  deserialiser. 
A  per-bit  switching  energy  is  thus  not  meaningfully  ascribed  to  the  deserializer  cell,  but  for  long 
deserializers  the  per-bit  switching  energy  of  the  deserializer  approaches  that  of  the  MeSR-based 
shift  register. 

Experimental  Evaluation 

In  order  to  investigate  eSFQ  logic  experimentally,  the  eSR,  MeSR  and  eDES  cells  were  laid  out 
and  their  circuits  were  reoptimized  to  account  for  the  extracted  layout  parasitics.  Several  eSFQ 
shift  registers  with  16-  and  32-bit  length,  as  well  as  deserializers  with  4-,  8-  and  16-bit  lengths 
were  assembled.  Fig.  2.4.7  shows  examples  of  the  experimental  eSFQ  chips  designed  for 
fabrication  using  the  HYPRES  Niobium  superconductor  integrated  circuit  fabrication  process. 
To  investigate  the  performance  of  the  designed  eSFQ  circuits  in  a  variety  of  environments,  12 
test  structures  were  laid  out  across  five  5x5  mm  chips  for  the  fabrication  with  a  4.5  kA/cm 
Josephson  junction  critical  current  density. 

Fig.  2.4.8  shows  examples  of  the  fabricated  eSR,  MeSR  circuits.  As  is  evident,  they  differ  in  the 
escape  junction  Ji  shunt  resistor  in  order  to  achieve  overdamping  for  the  eSR  design  and  critical 
damping  for  the  MeSR  design.  For  MeSR,  the  magnetic  flux  bias  was  implemented  as  a 
superconducting  line  under  the  cell  storage  inductors  to  induce  magnetically  the  required  phase 
shift.  Microphotographs  of  the  deserializer  are  depicted  in  Fig.  2.4.9.  In  order  to  protect  circuits 
from  flux  trapping,  ground  plane  moats  were  employed  as  well  as  ground  plane  holes  covering 
unused  chips  areas. 

To  concentrate  design  effort  on  the  eSFQ  demonstrator  cells  and  to  minimize  the  probability  of 
failure  in  the  periphery  circuits,  existing  conventional  RSFQ  cells  from  the  HYPRES  cell  library 
were  employed  as  a  testbed.  These  comprise  standard  interfaces  to  room-temperature  circuitry, 
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such  as  dc/SFQ  and  toggle-type  SFQ/dc  converters.  Fig.  2.4.9(b)  shows  an  RSFQ  testbed  made 
of  these  standard  library  RSFQ  cells.  The  test  chips  also  contain  standard  diagnostic  circuits  for 
fabrication  process  control  visible  in  Fig.  2.4.7. 
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Fig.  2.4.7.  Layouts  of  two  of  the  five  5x5  mm2  ICs  fabricated  with  HYPRES ’s  4.5  kA/cm2 
process,  one  with  three  shift  registers  (a),  and  one  with  a  16-bit  deserializer  (b).  Each  bias  line 
is  fed  from  multiple  contact  pads,  which  enables  experimental  investigation  of  different  bias 
fan-in  configurations. 

Besides  establishing  functional  correctness  of  the  designed  eSFQ  cells,  the  objective  was  to 
investigate  experimentally  the  effects  of  the  bias  cnrrent  distribution  in  superconducting  biasing 
network  at  the  initial  bias  current  ramp-up.  Since  eSFQ  circuits  do  not  have  resistors  in  the  bias 
network,  the  bias  distribution  relies  on  interplay  between  specific  inductances  of  the  bias  lines 
and  gate  current  limiting  jimctions.  This  is  not  easy  to  simulate  as  the  circuit  initialization  is 
inherently  a  slower  process  than  its  SFQ  operation.  For  this  reason,  several  versions  of  shift 
registers  with  different  width  (specific  inductance)  of  bias  distribution  buses  and  different  current 
injection  fan-in  were  designed. 

Thus,  in  addition  to  laying  out  different  combinations  of  the  basic  designed  cells,  these  were 
placed  in  a  variety  of  different  bias  lines.  The  designed  bias  lines  are  characterized  by  the  cell-to- 
cell  inductance  of  the  line,  ,  as  well  as  the  line-to-cell  limiting  inductance,  .  Bias  lines  of 
two  different  widths  (narrow:  pH,  wide:  pH)  were  laid  out  and  combined  with 

three  lengths  of  limiting  inductor  (short:  pH,  medium:  pH,  long:  pH). 

For  each  laid-out  structure,  bias  fan-in  was  conservatively  high.  Each  structure  has  its  own  bias 
line,  which  is  shared  by  all  cells  in  the  structure  (deserializer  structures  each  have  two  bias  lines. 
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one  for  the  read-  and  one  for  the  shift  operation).  One  bias  pin  was  allocated  to  eveiy  four  cells 
in  a  structme.  This  enables  comprehensive  investigation  of  different  bias-current  fan-in 
configurations  (biasing  from  all  available  pins  or  biasing  from  one  pin  only,  for  example). 
Fig.  2.4.7  depicts  two  examples  of  chip  layout,  illustrating  the  high  number  of  bias-current  pins. 
To  further  the  breadth  of  this  investigation,  one  structure  was  equipped  with  a  bias-current 
divider  that  binds  the  four  bias  line  entry  points  of  the  16-bit  shift  register  to  a  single  pin. 


overdamped 


limiting  junction 


storage  \ 
inductor 


critically  damped  Jx 


ground  plane  moats 


flux  bias  line 


80fim 


80fim 


(a)  (b) 

Fig.  2.4.8.  Layouts  of  eSFQ  shift  register  cells  eSR  with  overdamped  (a)  and  MeSR  with 
critically  damped  and  a  flux  bias  line  inductively  coupled  to  cell  storage  inductor  (b).  Cell 
sizes  (indicated  by  red  bowidaiy):  eSR:  80x110  ,  MeSR:  80x105  .  These  dimensions 

do  not  include  the  bias  line. 
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Fig.  2.4.9.  Photographs  of  deserializer  test  structures:  the  deserializer  base  cell  eDES  (a), 
and  a  corresponding  4-bit  eSFQ  deserializer  (b).  Bias  lines  and  some  key  devices  are 
indicated.  A  set  of  dc/SFQ  and  SFQ/dc  converters  enable  the  interface  with  room  temperature 
electronics. 


Experimental  evaluation  was  performed  with  test  patterns  applied  and  responses  measured  with 
the  Octopux  system.  Each  chip  was  tested  in  a  liquid  helium  dewar  using  HYPRES  standard 
cryoprobes.  Correct  operation  of  the  shift  register  structures  was  established  by  feeding  in  a  bit 
pattern  and  verifying  its  transmission  with  the  coll  ect  delay  (in  terms  of  clock  events).  Conect  - 
bit  deserializer  operation  was  established  by  feeding  in  a  pattern  of  length  with  the  shift  clock, 
applying  a  read  pulse,  and  then  verifying  the  parallel  readout  against  the  input  pattern.  This 
process  was  repeated  several  times  to  verify  deserializer  operation.  Figs.  4.10-11  depict 
examples  of  the  measured  conect  test  patterns  of  the  16-bit  eSFQ  shift  registers  and 
deserializers.  For  exhaustive  testing,  not  only  uniform  clock  and  data  patterns  were  employed, 
but  also  randomly  generated  ones. 
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Fig.  2.4.10.  Measured  correct  functionality  of  16-bit  eSFQ  shift  registers:  simple  pattern 
(a),  randomly  generated  data  and  clock  pattern  (b).  All  9  shift  registers  were  fully 
operational. 
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Fig.  2.4.11.  Measured  correct  functionality  of  16-bit  deserializer  for  different  input 
patterns.  This  circuit  divides  the  input  serial  stream  into  16  output  parallel  streams. 
Random  data  and  shift  clock  pulses  are  used  for  testing.  Note  that  measured  voltages  are 
seeded  automatically. 


To  determine  the  bias  margins  of  investigated  structures,  random  200-bit  test  patterns  were 
applied  to  the  devices  under  test  for  various  bias  currents.  Various  bias-current  feeding 
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configurations  were  investigated,  most  notably  biasing  from  one  pin  only  (repeated  for  each 
available  pin),  and  biasing  from  all  pins  simultaneously.  Only  the  largest  identified  continuous 
region  of  operation  was  considered.  All  tested  structures  passed  functional  testing  for  all  tested 
patterns.  The  measured  results  of  the  bias  current  margin  investigation  are  listed  in  table  1. 

Table  1.  Experimentally  determined  bias  margins  for  eSFQ  test  structures  across  five  chips:  a 
comprehensive  set  of  devices  in  different  bias  configurations,  measured  to  establish  functional 
correctness  of  the  devices  and  attempt  to  identify  desirable  traits  of  the  bias  line  layout. 


Bias  Line 

Bias  margins  [mA] 

Kind 

Len 

Comment 

All  pins 

Single  pin 

Shift  Reg 

32 

Narrow 

Medium 

Shift  Reg 

16 

Narrow 

Medium 

Bias  Tree 

- 

Shift  Reg 

32 

Wide 

Medium 

Shift  Reg 

16 

Narrow 

Short 

Shift  Reg 

16 

Narrow 

Medium 

Shift  Reg 

16 

Narrow 

Long 

Shift  Reg 

32 

Narrow 

Medium 

Flux  Bias 

Not  tested 

Shift  Reg 

16 

Narrow 

Medium 

Flux  Bias 

Not  tested 

Shift  Reg 

16 

Wide 

Medium 

Flux  Bias 

(Clk) 

Not  tested 

Deserialiser 

16 

Narrow 

Medium 

Flux  Bias 

(Rd) 

Not  tested 

(Clk) 

Deserialiser 

8 

Narrow 

Medium 

Flux  Bias 

(Rd) 

Not  tested 

(Clk) 

Deserialiser 

4 

Narrow 

Medium 

Flux  Bias 

(Rd) 

Not  tested 

The  measured  bias  margins  roughly  conform  to  expectations  extrapolated  from  simulated  results. 
Simulations  relied  on  (small)  4-bit  configurations  to  reduce  computation  time  to  design-friendly 
speeds  and  were  performed  in  high-speed  testbeds,  not  reflecting  the  actual  devices  under  test  or 
their  periphery.  The  observed  agreement  with  simulations  confirmed  that  the  designed  structures 
are  robust  and  scale  well. 

As  Fig.  2.4.12  indicates,  the  MeSR-based  shift  registers  (with  magnetic  flux  bias)  functioned 
only  when  the  magnetic  bias  is  applied,  which  corresponds  to  our  simulations.  Bias  margins  do 
not  seem  dependent  on  the  length  of  the  shift  register  structure,  although  the  dataset  is  too  small 
to  identify  definite  trends. 
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Fig.  2.4.12.  Measured  bias  margins  vs.  magnetic  bias  for  different  MeSR-based  shift  registers. 


Fig.  2.4.13.  Measured  bias  margins  vs.  supplied  magnetic  bias  current  of  8 -bit  and  16-bit 
deserializer  test  structures. 
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The  deserializer  structures  under  test  worked  without  the  magnetic  flux  bias  applied.  As 
demonstrated  in  Fig.  2.4.13,  applying  the  magnetic  flux  bias  improves  their  bias  margins, 
doubling  them  in  the  case  of  the  16-bit  deserializer.  For  the  16-bit  deserializer,  the  clock  and  read 
bias  margins  exhibit  comparable  absolute  values  and  dependence  on  the  magnetic  flux  bias, 
which  is  consistent  with  simulations.  The  8-bit  structure  does  not  mirror  this  symmetry,  which 
may  indicate  higher  susceptibility  to  the  periphery. 

Comparatively  low  bias  margins  were  recorded  for  the  16-bit  deserializer.  We  attribute  this  to 
the  relative  complexity  of  the  test  structure  and  to  an  ill-designed  monitor  setup.  The  high  fan¬ 
out  of  the  deserializer  requires  a  large  number  of  SFQ/dc  monitors,  as  well  as  several  dc/SFQ 
converters  to  apply  the  test  patterns.  Due  to  the  high  pin  count  of  the  test  structure  (two  bias 
lines,  parallel  output),  all  peripheral  cells  were  biased  from  a  single  pin,  which  yielded  low 
margins  for  the  peripheral  bias,  potentially  depressing  the  margins  of  the  device  under  test. 

As  for  designing  optimal  bias  line  inductances  and  limiting  inductors,  the  experimental  test 
structures  did  not  yield  conclusive  trends.  We  conclude  from  this  that  these  inductances,  at  least 
when  constrained  to  the  parameter  ranges  investigated,  do  not  significantly  affect  bias  current 
distribution,  which  is  a  positive  result.  Promising  are  the  high  bias  margins  measured  for  the  16- 
bit  shift  register  biased  from  a  single  pin  feeding  a  bias  tree.  This  suggests  that  the  off-chip 
biasing  effort  for  eSFQ  systems  should  be  comparable  to  systems  based  on  conventional  RSFQ. 

Conclusions 

We  have  demonstrated  for  the  first  time  eSFQ  digital  circuits  -  a  new  ultra-low  power  RSFQ- 
type  logic  capable  of  achieving  a  significant  increase  in  energy  efficiency  of  computing  systems. 
The  demonstrated  eSFQ  shift  registers  reached  ~0.8  aJ/bit.  This  number  includes  the  integrated 
SFQ  clock  lines.  The  achieved  energy  per  bit  is  over  two  orders  of  magnitude  better  than  that  of 
the  same  circuits  implemented  in  conventional  RSFQ  logic. 

Similar  to  ERSFQ,  another  energy  efficient  RSFQ-type  logic,  eSFQ  relies  on  limiting  Josephson 
junctions  to  distribute  the  dc  bias  to  logic  gates.  In  contrast  to  ERSFQ,  the  limiting  junctions  do 
not  switch  during  circuit  operation  and  are  needed  only  for  the  initial  bias  current  ramp-up.  This 
is  achieved  by  the  bias  current  injection  via  two-junction  decision  making  pairs  (DMPs)  which 
have  equal  phases  during  the  gate  operation  independent  of  digital  data.  This  also  allows  eSFQ 
circuits  to  operate  without  large  bias  inductances  otherwise  needed  to  minimize  data  dependent 
bias  current  fluctuations.  As  a  result,  the  eSFQ  circuit  layouts  are  more  dense  and  easier  to  scale. 

However,  we  found  that  the  injection  of  bias  current  via  DMPs  depresses  parameter  margins.  In 
this  work,  we  explored  ways  to  rectify  this  effect  by  using  either  stronger  junction  damping 
(slowing  down  the  DMP  escape  junction)  or  passive  phase  shifters.  The  first  method  is  simpler 
in  the  implementation  but  leads  to  a  reduction  of  the  maximum  speed  of  operation.  The  second 
method  does  not  limit  the  maximum  clock  frequency  but  requires  an  introduction  of  extra  phase 
shifting  elements  such  as  flux  bias  line  or  7t-j unctions. 

Passive  phase  shifters  would  bring  an  additional,  perhaps  even  more  significant  result:  a 
reduction  of  the  required  dc  bias  for  eSFQ  gates.  We  demonstrated  that  phase  shifters  allow  a 
-20%  gate  bias  reduction  which  directly  translates  to  the  corresponding  -20%  reduction  of  the 
gate  dynamic  power  dissipation,  .  We  believe  that  this  number  can  be  further 

improved  with  the  targeted  circuit  optimization.  For  example,  the  critical  currents  used  in  these 
circuits  are  in  a  180-300  pA  range,  which  is  larger  than  required  by  thennal  noise  at  4  K. 
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In  contrast  to  a  simple  flux  biasing  line  implemented  in  this  work,  the  introduction  of  phase 
shifting  7r-junctions  would  require  the  incorporation  of  ferromagnetic  materials  into  a 
conventional  superconducting  fabrication  process.  This  might  look  cumbersome  and  expensive 
at  first,  but  the  work  on  the  superconductor-ferromagnetic  fabrication  process  is  already 
happening.  It  is  motivated  by  the  recent  efforts  in  superconducting  magnetic  memory 
developments  and  research  in  superconducting  spintronics.  We  expect  that  superconductor- 
ferromagnetic  phase  shifters  will  be  preferable  for  eSFQ  circuits. 

The  demonstrated  eSFQ  shift  register  design  is  quite  compact  comprising  five  junctions  per  bit 
excluding  the  passive  bias-limiting  junction.  The  total  number  of  junctions  for  a  16-bit  shift 
register  is  80.  This  compares  favourably  to  an  RQL  shift  register  with  8  junctions  per  bit. 

While  achieving  a  significant  improvement  in  power  dissipation,  the  demonstrated  eSFQ  logic 
retains  all  key  advantages  of  conventional  RSFQ  logic:  high  speed,  high  throughput,  dc  bias, 
controllable  and  programmable  SFQ  clock,  and  lossless  interconnects.  As  opposed  to  ac  biasing 
and  global  clock,  the  eSFQ  dc  bias  and  locally  controllable  SFQ  clock  is  particularly 
advantageous  for  scaling  up  integrated  circuit  complexity  to  millions  of  junctions.  Simple  eSFQ 
layout  requirements  without  the  need  for  transfonners  and  microwave  plumbing  also  bode  well 
in  tenns  of  scaling  up  the  circuit  density.  Finally,  the  ability  to  control  the  SFQ  clock  distribution 
allows  management  of  eSFQ  circuit  power  dissipation.  This  is  the  pre-requisite  for  the 
development  of  energy  proportional  processors  -  the  ultimate  goal  of  computing  system 
developers. 

Superconducting  systems  require  cryocooling.  The  efficiency  of  4  K  cryocoolers  ranges  from 
-10,000  WAV  for  small  cryocoolers  (heat  capacity  of  <  1W)  to  <  400  W/W  for  large  machines 
(heat  capacity  of  600-900  W).  For  high-end  computing  systems,  the  larger  cryocoolers  are 
relevant.  For  example,  for  the  Linde  LR280  with  360  WAV  efficiency,  this  puts  the  eSFQ  circuit 
demonstrated  in  this  work  at  -  290  aJ/bit.  This  is  close  to  the  projected  bit  energy  for  a  future 
CMOS  gates,  however,  one  should  realize  that  the  biggest  energy  loss  (>~pJ/bit)  in  CMOS 
circuits  is  in  data  movement.  In  contrast,  superconducting  SFQ  circuits  including  eSFQ  can  use 
ballistic  data  transport  at  the  similar  sub-aJ  per  bit  level  as  for  logic  and  register  circuits.  There 
is  also  a  potential  advantage  of  superconducting  SFQ  circuits  over  room-temperature 
competition  in  power  density,  which  constrained  the  progress  towards  faster  CMOS  circuits. 
Assuming  that  eSFQ  circuits  can  be  scaled  to  the  CMOS  circuit  densities,  power  density  will  be 
at  least  three  orders  of  magnitude  lower,  as  the  cryocooling  penalty  does  not  change  this 
difference.  The  practical  advantage  of  this  much  lower  power  density  requires  further  study  to 
account  comparative  heat  removal  capabilities  and  operation  temperature  ranges.  This  will  be 
done  in  the  next  project  period. 
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III.  Energy-Efficient  Interface  based  on  Ultra-Low  Input  Voltage 
polarization  modulating  VCSELs 

Scope 

The  overall  goal  of  this  multi -phase  project  is  to  develop  and  demonstrate  the  energy-efficient 
data  output  interface  between  cryogenic  4  K  modules  and  room-temperature  systems  using  a 
combination  of  energy-efficient: 

•  low-voltage  drivers  based  on  energy  efficient  SFQ  logic; 

•  lossless,  low  dispersion  superconductor  cables; 

•  low-voltage  electro-optical  links  based  on  polarization  modulating  (PM)  VCSELs. 

Objective  is  to  achieve  a  record  energy-efficiency  of  <0.2  pj/bit  at  12  Gb/s.  The  projected  data 
rate  is  targeted  to  12.5  Gbps  and  then  further  extended  to  higher  data  rates  (—25  Gb/s). 

The  current  work  focuses  on  the  development  of  a  critical  component  technology  -  energy- 
efficient  high-speed,  very  low  input  voltage  (~1  mV)  Polarization  Modulation  (PM)  VCSELs: 

Project  funding  for  this  project  period  (Phase  2b):  $  1 84K. 

The  project  team  consists  of  HYPRES  and  Prof.  K.  Choquette  group  from  University  of  Illinois 
(Urbana).  Report  cover  progress  from  Sep.  1,  2012  to  Aug.  30,  2013. 

The  Energy-efficient  Cryogenic  Optical  (ECO)  data  link  design  is  based  on  balancing  power 
dissipation  and  signal  gain  at  each  temperature  stage  to  maximize  overall  energy  efficiency 
following  the  recently  introduced  Thermo-Gain  Rule.  To  achieve  VCSEL  light  emission  with 
two  switchable  distinct  polarization  modes,  a  cruciform-shaped  anisotropic  optical  cavity  is 
formed  by  fabrication  of  a  photonic  crystal  with  etched  periodic  air  holes  surrounding  the 
unetched  cruciform  region.  In  this  report,  we  present  the  results  of  design,  fabrication,  and 
preliminary  testing  of  the  ECO  data  link  components 

Introduction 

The  lack  of  the  energy-efficient,  high-bandwidth,  and  scalable  technology  for  data  links  from 
cryogenic  low-power  superconducting  circuits  to  room-temperature  higher-power  semiconductor 
circuits  has  been  a  serious  impediment  for  application  of  superconducting  electronics  including 
digital-RF  receivers,  instrumentation,  high  perfonnance  computing,  network  switches,  sensor 
systems,  etc.  [l]-[5].  There  were  multiple  attempts  to  address  this  problems  [6] -[10]  using  all¬ 
electrical  or  electro-optical  (E/O)  links.  The  E/O  data  links  are  generally  preferred  for 
communication  in  high-perfonnance  computing  systems.  For  exascale  systems,  the  link  energy 
efficiency  should  be  on  the  order  of  2  pj/bit  [11].  The  use  of  an  optical  fiber  to  carry  digital  data 
has  an  additional  benefit  for  cryogenic  systems  due  to  its  negligible  heat  leak  and  signal 
attenuation.  In  this  paper,  we  will  focus  on  the  development  of  key  components  for  the  energy 
efficient  cryogenic  E/O  links  capable  of  transmitting  data  at  >20  Gbps  and  ~1  pJ/bit. 

Technical  Approach:  Cryogenic  Output  Data  Link 

Fig.  3.1  shows  a  diagram  of  our  energy-efficient  output  digital  link  from  4  K  superconducting 
modules  to  room  temperature  electronics.  The  SFQ  digital  data  is  converted  to  ~1  mV  voltage 
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pulses  using  low-power,  compact  superconducting  drivers.  These  relatively  weak  pulses  are 
transported  to  a  -70  K  stage  using  low-loss,  low-dispersion,  low-heat-leak  electrical  cables.  At 
this  stage,  the  data  is  converted  to  the  optical  domain  and  transmitted  to  room  temperature  via 
low-loss,  low-dispersion,  low-heat-leak  optical  fiber.  To  determine  the  total  link  efficiency,  the 
power  dissipation  at  each  temperature  stage  has  to  include  the  cryocooler  (closed  cycle 
refrigerator,  CCR)  efficiency.  These  efficiencies  vary  depending  on  CCR  size  [12],  [13]  as 
shown  in  Fig.  3.1.  For  some  CCRs,  warmer  stages  of  CCR  have  limited  excess  cooling  capacity 
(a  flat  load  cmve)  providing  an  opportunity  to  cool  circuits  without  the  increase  of  overall  power 
[13].  To  achieve  the  link  energy  efficiency  FEE  maximization,  we  distribute  amplification  of  the 
transmitted  signal  (and,  therefore,  dissipated  power)  over  different  temperature  stages  following 
the  Thermo-Gain  Rule  (TGR)  [14]:  1/Fee  ~  Gj/Tj  +  G2/T2  +  G3/T3,  where  G  -  gain  of  an 
amplifier  located  at  temperature  7.  e.g.,  T  -  4  K,  70  K,  300  K.  The  amplification  gain  at  each 
temperature  stage  should  be  kept  to  a  minimum  -  just  enough  to  provide  a  low-bit-error  data 
transmission  to  a  warmer  stage  over  the  connecting  cable.  The  less  signal  degradation  in  the 
cable,  the  less  signal  gam  is  required  at  the  colder  stage. 


Room- 
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Fig.  3.1.  ECO  Data  Link  concept  -  an  energy-efficient  data  interface  to  transport  high-speed  digital  data 
from  cryogenic  to  room-temperature  stages.  Signal  amplification  is  distributed  over  different  temperature 
stages  of  closed  cycle  refrigerator  (CCR). 

Superconductor  SFQ  electronics  operate  with  SFQ  pulses  of  Ej=  dVc  ~  10  19  J.  The  low  signal 
energy  is  the  major  advantage  of  SFQ  electronics.  At  the  same  time,  this  poses  a  significant 
challenge  for  getting  output  signals  to  room-temperature  electronics  or  optics.  To  amplify  these 
signals,  a  munber  of  superconducting  amplifiers  (drivers)  were  developed  [6]-[9],  [  1 5]-[  1 7] .  To 
minimize  the  power  dissipation,  we  use  the  smallest  and  fastest  driver:  the  toggle  flip-flop  (TFF) 
SFQ/dc  converter.  To  increase  its  energy-efficiency,  the  circuit  dc  bias  was  re-designed  using  a 
combination  of  ERSFQ-style  biasing  [18]  for  the  TFF  circuit,  and  the  dc  bias  for  the  output 
junction  pah  applied  via  the  output  data  line.  This  new  scheme  was  successfully  verified 
experimentally.  These  drivers  are  also  the  most  compact  compared  to  other  known 
superconducting  dr  ivers,  which  is  an  important  feature  for  high-density  circuits. 
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The  output  data  cable  connecting  the  4  K  drivers  to  the  70  K  circuits  has  to  prevent  a  heat  leak, 
provide  minimum  losses  and  dispersion  for  the  mV-level  signals  transmission,  and  avoid  joule 
heating  associated  with  dc  biasing  via  output  lines.  All  these  requirements  can  be  met  by  HTS 
cables.  The  HTS  cables  have  been  already  proven  for  dc  biasing  [19].  To  provide  proper 
impedance  for  high-bandwidth  data  transmission,  HTS  bit-lines  can  be  formed  as  coplanar  lines. 
To  achieve  lower  crosstalk  and  denser  line  placement,  a  two-layer  HTS  microstrip  multi-bit 
cable  is  being  developed  [20]. 

The  choice  of  E/O  component  is  guided  by  a  significant  signal  level  difference  between  the 
~mV-level  output  signals  of  superconducting  drivers  and  vo/f-levels  signals  typically  required  to 
drive  existing  E/O  devices.  The  lowest  driving  voltage  E/O  modulators  available  today  require 
-400  mV  [21]  with  prospects  to  achieve  sub- 100  mV  in  the  future.  However,  even  this  goal  is 
hardly  acceptable,  since  it  would  require  the  use  of  power-  and  area-consuming  intermediate 
amplifiers  to  amplify  sub-mV  signals. 

As  an  alternative  to  E/O  modulators,  vertical  cavity  surface  emitting  lasers  (VCSELs)  can  be 
used  [22]-[28],  These  laser  diodes  require  low  operation  power,  allow  high-density  integration, 
and  can  be  optimized  for  operation  at  cryogenic  temperatures.  However,  the  amplitude  of  the 
VCSEL  output  is  modulated  using  typically  a  few  mA  variation  of  laser  drive  current 
(corresponding  to  >  100  mV  voltage  swing).  Therefore,  the  conventional  approach  of  modulating 
intensity  of  the  laser  output  is  not  advantageous  compared  to  E/O  modulators.  In  order  to  obtain 
a  significant  reduction  in  driving  signal  to  the  millivolt  levels,  we  propose  to  modulate 
polarization  of  VCSEL  output  to  transmit  digital  data  as  shown  in  Fig.  3.2. 

In  a  polarization-modulation  (PM)  VCSEL,  both  the  transverse  cavity  geometry  and  current 
injection  are  designed  to  be  anisotropic  to  fix  the  VCSEL  emission  polarization  [22],  [23],  which 
otherwise  would  be  random  [24].  In  our  cruciform  VCSEL  design,  two  orthogonal  polarizations 
are  allowed  due  to  the  effect  of  index-guiding  and  loss  confinement  in  the  laser  diode,  shown  in 
Fig.  3.2.  By  injecting  current  into  one  of  the  two  perpendicular  arms,  the  dominant  polarization 
direction  can  be  selected  (Fig.  3.2).  In  contrast  to  VCSEL  direct  intensity  modulation,  such 
polarization  switching  should  only  require  a  small  driving  signal,  and  the  VCSEL  emission 
intensity  will  remain  relatively  constant. 

The  transverse  optical  aperture  is  defined  by  a  cross-shaped  defect  in  a  square-lattice  pattern  of 
air  holes.  The  lattice  of  holes  creates  a  photonic  crystal  which  can  be  used  to  control  the  optical 
confinement  [25].  Injection  anisotropy  is  achieved  by  the  bias  applied  to  the  two  orthogonally 
positioned  pairs  of  metal  contacts  [22]  across  a  cruciform  current  aperture  surrounded  by  high- 
energy  proton  implantation. 

We  demonstrate  experimentally  that  the  dominant  polarization  of  PM  VCSEL  emission  follows 
the  direction  of  current  injection,  as  indicated  in  Fig.  3.3.  A  polarizer  is  inserted  in  front  of  the 
photodetector.  The  light  output  of  the  VCSEL  operated  at  room  temperature  is  measured  to 
detennine  the  polarization  of  emission.  Fig.  3.3  shows  the  polarization-resolved  continuous  wave 
laser  outputs  when  the  horizontal  (threshold  at  12  mA)  or  vertical  (threshold  at  17  mA)  ann  of 
the  cruciform  VCSEL  are  injected  with  current  [22],  By  injection  into  either  ann  of  the  laser 
cavity  at  17  mA,  the  direction  of  the  laser  polarization  can  be  selected  with  the  direction  of 
current  injection  with  nearly  10  dB  selectivity. 
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(a)  (b) 

Fig.  3.2.  (a)  Modulation  of  the  VCSEL  light  output.  The  polarization  of  the  output  light  is 

varied  between  the  two  linear  orthogonal  states  as  conti-olled  by  the  separate  electrical  contacts,  (b) 
Scanning  electron  microscope  image  of  a  PM  VCSEL  showing  square-lattice  photonic  crystal  defects 
and  orthogonally  positioned  top  metal  contacts  to  control  VCSEL  output  light  polarization. 
v  v.  .mils  nave  oeen  cnaracienzeu  ai  cryogenic  temperatures  [zoj,  [z  /  j.  n  nas  oeen  rounu  mat  the 
threshold  current,  output  power,  and  laser  emission  characteristics  can  all  be  optimized  for 
ciyogenic  operation  by  engineering  the  epitaxial  design  of  a  VCSEL.  In  particular,  by  creating  an 
offset  between  the  maximum  wavelength  of  the  laser  gain  produced  by  the  quantum  wells  as 
compared  to  the  allowed  VCSEL  emission  from  the  cavity  resonance,  the  performance  of  the 
VCSEL  can  be  timed  for  a  particular  operation  temperature.  For  ciyogenic  VCSELs,  then  optical 
performance  will  be  reduced  at  room  temperature,  but  found  to  be  optimal  at  then  design 
temperature.  This  implies  that  custom  compound  semiconductor  epitaxial  materials  are  required 
for  optimum  ciyogenic  VCSELs.  We  have  been  optimizing  VCSEL  performance  by  fabricating 
samples  with  a  dielectric  top  distributed  Bragg  reflector  (DBR)  [28].  In  the  following  section,  we 
provide  details  of  this  work. 

Development  of  Polarization  Modulating  VCSELs 

This  section  describes  the  research  activity  at  the  University  of  Illinois  by  Prof.  Kent  Choquette 
and  his  students  dining  the  period  Sep.  1,  2012  to  Aug.  30,  2013.  The  specific  objective  is  to 
explore  low  power  modulation  of  the  light  output  from  a  VCSEL  emitting  at  850  urn.  The  Phase 
II  activity  began  approximately  March  1,  2012  and  is  scheduled  to  end  approximately  Sept.  30, 
2013.  This  research  includes  the  following  tasks:  (1)  Design  and  fabricate  revised 
photolithographic  masks  for  photonic  ciystal  ion  implanted  VCSELs  for  polarization  modulation 
using  “half  VCSEL”  epitaxial  materials;  (2)  fabricate  prototype  VCSEL  devices  using  epitaxial 
VCSEL  material  and  characterize  devices;  (3)  fabricate  prototype  polarization  modulation 
VCSEL  devices  using  epitaxial  VCSEL  material;  (4)  characterize  at  room  temperature  the  laser 
performance,  including  polarized  light  versus  current,  emission  spectrum,  and  polarization 
properties;  and  (5)  provide  recommendations  and  assistance  to  HYPRES  staff  for  packaging 
VCSELs  for  ciyogenic  measurements.  The  following  describes  in  detail  our  results  to  date. 

Revised  Mask  Design 

Starting  in  the  Fall  of  2012,  revised  mask  designs  were  made  in  order  to  fabricate  VCSELs  with 
a  top  distributed  Bragg  reflector  (DBR)  mirror  composed  of  dielectric  materials  (Fig.  3.3).  Hence 
the  semiconductor  epitaxial  materials  correspond  to  a  “half  VCSEL”  composed  of  a 
semiconductor  bottom  DBR,  and  active  region,  and  capped  with  a  contact  layer.  On  top  of  this 
wafer,  after  device  patterning  and  fabrication,  a  top  DBR  is  deposited.  The  advantage  of  this 
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structure  is  to  avoid  current  transport  through  the  top  DBR  which  was  found  in  Phase  I  to  create 
isotropic,  rather  than  directional,  current  injection.  Two  mask  levels  were  made:  an  ion 
implantation  mask  (used  to  define  the  current  path)  which  is  smaller  than  previously  used,  and  a 
mask  to  define  the  extent  of  the  top  DBR  mirror  over  the  cruciform  cavity  (see  Fig.  3.4). 


Light 


Fig.  3.3.  Side  view  sketch  of  polarization  modulation  VCSEL  with  top  dielectric  DBR  mirror. 


/ 

/ 

/ 

/ 


Fig.  3.4.  Sketch  of  example  device  from  the  revised  masks:  smaller  implant  (green  layer)  and 
top  DBR  mirror  pattern  (light  purple). 
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Prototype  dielectric  DBR  VCSEL  Fabrication  and  Characterization 

Using  epitaxial  material  with  a  lower  semiconductor  DBR  and  active  region  (i.e.  "half  VCSEL) 
round  VCSELs  were  fabricated  to  confirm  the  viability  of  the  material  and  to  provide  Hypres 
with  samples  for  low  temperature  testing.  Shown  in  Fig.  3.5(a)  is  a  side  view  sketch  showing  the 
top  contact  under  the  dielectric  DBR,  while  Fig.  3.5(b)  shows  an  optical  microscope  view  of  the 
fabricated  devices.  Note  in  Fig.  3.5(b)  the  adjacent  metal  contact  pads  have  an  extra  thickness  of 
3  pm  of  Au  to  enable  wire  bonding. 


Fig.  3.5.  (a)  Side  view  sketch  of  VCSEL  with  top  dielectric  DBR  mirror;  and  (b)  optical  image. 

Fig.  3.6  is  a  plot  of  the  reflectivity  at  normal  incidence  for  the  top  dielectric  DBR.  The  mirror 
consists  of  6  periods  of  TiCk/SiCF.  Initial  attempts  to  deposit  these  layers  were  carried  out  at  the 
University  of  Illinois,  but  it  was  ultimately  decided  to  utilize  a  commercial  vender  (KLabs  in 
New  Jersey).  Shown  in  Fig.  3.7  are  the  light  and  voltage  versus  injection  current  for  the 
fabricated  dielectric  DBR  VCSELs.  Note  the  slope  change  to  the  voltage  curves  at  lasing 
threshold  (e.g.  approximately  3  mA  in  Fig.  3.7(a)).  The  finite  voltage  drop  observed  at  currents 
less  than  threshold  current  arises  from  current  leakage  in  these  devices.  This  in  turn  arises  from 
the  deficiency  in  the  current  isolation  in  the  VCSEL,  which  will  be  rectified  in  the  polarization 
modulation  VCSELs  with  the  revised  implantation  mask.  Device  die  of  dielectric  DBR  VCSELs 
with  various  aperture  sizes  were  provided  to  Hypres  on  April  30,  2012.  Specifically 
approximately  10  die  of  10,  15,  and  5  pm  aperture  VCSELs,  where  each  die  contained  3  rows  of 
10  identical  VCSELs.  These  samples  were  subsequently  tested  by  Hypress  at  cryogenic 
temperatures  to  determine  their  operating  characteristics  and  to  verify  their  viability  at  low 
temperature. 
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Fig.  3.6:  Reflectance  spectra  of  dielectric  DBR  deposited  by  KLabs  showing  high  reflectivity  >  at 
980  nm. 
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(a) 


EMC66032.5umlHperturel.IVlZ] 


(b) 

Fig.  3. 7:  Light  and  voltage  versus  injection  current  for  dielectric  DBR  VCSEL  with:  (a)  5  pm; 
and  (b)  15  pm  aperture. 
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(b) 


(c) 

Fig.  3.8:  Polarization  modulation  VCSELs  before  top  mirror  deposition:  (a)  optical  image  of 
cruciform  device  showing  electroluminescence  in  one  of  the  arms  of  the  cruciform  cavity;  (b) 
SEM  image  showing  photonic  crystal  polarization  modulation  VCSEL;  and  (c)  SEM  image  of  a 
unit  cell  of  devices. 

Fabrication  of  Dielectric  DBR  Polarization  Modulation  VCSELs 

Starting  in  May  2012,  the  fabrication  process  for  polarization  modulation  VCSELs  began.  In  the 
Appendix  we  show  the  process  follower  for  this  fabrication  run.  Because  of  the  new  device 
structure  (top  dielectric  DBR),  several  steps  required  process  development.  These  included  the 
ion  implantation,  the  polyimide  planarization,  and  the  dielectric  DBR  patterning.  Fig.  3.8  shows 
optical  images  and  scanning  electron  microscope  images  of  the  polarization  modulation  VCSELs 
before  deposition  of  the  top  DBR  mirror. 

Preliminary  testing  of  these  devices  before  deposition  of  the  top  dielectric  DBR  mirror  shows 
diode  current-voltage  characteristics,  and  clear  electroluminescence  is  observed.  As  evident  in 
Fig.  3.8(a),  spontaneous  emission  with  current  injection  is  evident  in  one  of  the  arms  of  the 
cruciform  device.  Two  samples  were  sent  for  dielectric  DBR  mirror  deposition.  At  the  present 
time  these  devices  are  undergoing  their  final  processing  steps,  and  laser  emission  and 
characterization  will  begin  starting  in  September. 
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Cryogenic  Test  Setup  Development 

For  cryogenic  characterization  of  the  fabricated  PM  VCSELs  at  low  temperatures  and 
demonstrate  an  entire  energy-efficient  interface,  we  developed  complete  experimental  setup  to 
characterize  VCSEL  at  various  temperatures  ranging  from  cryogenic  temperature  of  70K  to  room 
temperature  of  300K.  We  constructed  a  test  setup  based  on  the  available  two  stage  cryocooler 
Sumitomo  Heavy  Industries  (SHI)  SRDK-101DP-1 1C.  Earlier  in  this  program,  we  perfonned  the 
modification  of  the  cryopackage  in  order  to  adapt  it  for  VCSEL  samples  and  provide  the  required 
temperature  control  necessary  for  study  temperature-dependence  of  its  perfonnance.  During  this 
phase  of  the  program,  we  specifically  perfonned  the  following: 

•  Developed  and  assembled  complete  experimental  setup  to  test  VCSELs  at  cryogenic 
temperatures. 

•  Developed  Lab  View©  software  program  for  automated  data  acquisition  and  thorough 
characterization  of  VCSELs  at  cryogenic  temperatures. 

•  Mounted  VCSELs  at  cryostage  of  the  cryocooler  with  different  aperture  sizes  of  5,  10  and  15 
um.  This  cryostage  with  an  active  temperature  control  allowing  setting  operation 
temperatures  in  range  from  ~70  K  to  ~200  K  for  VCSEL  samples  mounted  on  the  first 
temperature  stage.  Here  we  note  that,  superconducting  chips  with  ERSFQ/eSFQ  circuits  will 
be  installed  at  the  2nd  (4  K)  stage  of  this  cryocooler.  Both  these  stages  supplied  with  their 
own  calibrated  thermometers. 

•  Verified  correct  operation  of  complete  experimental  setup  by  characterizing  VCSELs  at 
room  temperature  of  300K  and  cryogenic  temperatures  in  the  range  of  from  70  K  to  200  K. 
VCSELs  currejnt-voltage  and  current-power  characteristics  have  been  measured. 

We  assembled  experimental  setup  comprised  of  cryocooler  with  modified  cryostage  to 
accommodate  VCSELs,  temperature  controller,  optometer,  current  source  and  voltmeter  all 
governed  by  PC  for  data  acquisition  and  control  to  evaluate  VCSELs  at  room  and  cryogenic 
temperatures.  The  complete  setup  is  shown  in  Fig.  3.9a.  Fig. lb  shows  arrangements  inside  the 
cryocooler  vacuum  chamber  with  chip  with  multiple  VCSELs  (see  Fig.  3.10)  facing  down  an 
optical  detector  connected  by  low-noise  coaxial  cable  to  an  optometer. 

Before  testing  began  the  chip  containing  the  VCSEL  samples  was  secured  to  an  OFHC  mount 
and  PCB  using  a  conductive  aluminum  paste  which  is  shown  in  Fig.  3.10.  Once  the  chip  is 
glued,  each  VCSEL  sample  to  be  tested  is  wire  bonded  to  the  PCB.  In  order  to  supply  current 
from  room  temperature  current  source,  flat  connectors  are  soldered  connected  by  wires  to  room 
temperature  for  measurements. 
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Fie.  3.9.  Comvlete  exoerimental  sefiw  to  characterize  VCSELs  from  cryo-  to  room  temperature: 
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Fig.  3.10.  Chip  with  VCSEL  samples  glued  to  OFHC  mount  and  wire  bonded  to  PCB. 

Al  experimental  instruments  were  connected  via  a  GPIB  interface  to  a  PC.  Data  collection  itself 
was  done  using  custom  program  created  using  LabView©.  The  software  (screenshot  shown  in 
Fig.  3.11)  requires  the  user  to  choose  a  starting  current,  a  step  current,  and  a  final  current  so  that 
the  current  source  can  run  a  DC  current  sweep  for  single  VCSEL  while  measuring  voltage  and 
optical  responses.  By  taking  advantage  of  the  GPIB  network,  the  instalments  are  able  to  respond 
to  one  another,  and  take  readings  in  synced  intervals,  as  directed  by  the  LabView©  program. 


74 


HYPRES  Final  Report  W911NF-09-C-0036  (proposal  #55336PHQC) 


Fig.  3.11.  Screenshot  of  Lab  View  data  acquisition  program. 


To  verify  our  experimental  setup  and  Lab  View  data  acquisition  program  we  have  measured  three 
conventional  VCSEL  samples  with  different  aperture  of  5,  10  and  15  um.  We  measured  the  same 
set  consisting  of  current-voltage  and  current-power  characteristics  for  all  VCSEL  samples.  Fig. 
3.12a  shows  typical  current  voltage  and  current-power  characteristics  for  particular  15  um 
sample  (d_Yl)  taken  at  300  K.  The  VCSEL’  current-voltage  characteristic  did  not  change  much 
with  temperature:  slight  changes  in  voltage  were  caused  by  change  in  lead  resistance.  Fig.  3.12b 
presents  current-power  dependencies  for  different  VCSEL  temperatures  -  from  cryogenic  to 
room  temperature  ones  showing  increase  of  lasing  threshold  with  lowering  of  the  temperature. 
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Fig.  3.12  (a)  Current-voltage  and  (b)  Current-power  characteristics  for  d_Yl  15  um  VCSEL 

current-power  characteristics  for  measured  at  different  temperatures. 
d_Yl  15  um  VCSEL  measured  at 
300  K. 
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Development  of  Energy-Efficient  on-chip  Drivers 

In  the  project,  we  worked  on  the  lowest  power,  fastest  and  smallest  area  drivers  -  SFQ-to-dc  (or 
SFQ/dc)  converters,  to  modulate  VCSEL  polarization.  In  order  to  increase  energy  efficiency,  we 
have  converted  this  SFQ-to-dc  converter  to  ERSFQ  logic.  The  standard  RSFQ  SFQ/dc  converter 
consist  of  a  flip-flop  and  output  junction  pair  generating  output  voltage  depending  on  a  state  of 
the  flip  flop.  Typically,  a  toggle  flip  flop  (TFF)  is  used.  If  the  conversion  of  TFF  from  standard 
RSFQ  to  ERSFQ  is  relatively  straightforward,  biasing  for  the  output  junction  pair  is  not  obvious 
as  its  output  voltage  should  exceed  the  voltage  on  a  clock  line.  During  the  previous  project 
periods,  we  invented  a  solution  to  this  problem  by  feeding  output  pair  of  the  SFQ/dc  converter 
off  the  VCSEL  device  via  the  connecting  data  cable  (see  Fig.  3.13).  The  test  circuits  were 
designed  for  HYPRES  conventional  4.5  kA/cm2  fabrication  process. 

In  this  project  period,  we  fabricated  the  ERSFQ  SFQ/dc  converter  with  output  stage  biasing 
enabled  via  output  data  line  from  the  VCSEL.  Fig.  3.14  shows  a  micrograph  of  integrated  circuit 
fragment  with  the  designed  test  circuit  consisting  of  dc/SFQ  converter,  Josephson  Transmission 
Line  (JTL)  and  the  ERSFQ  output  SFQ/dc  converter.  The  dc/SFQ  and  JTL  are  made  in  standard 
RSFQ  logic  to  reduce  risk,  while  the  SFQ/dc  converter  is  designed  using  ERSFQ  logic  approach. 
Fig.  3.15  shows  the  results  of  experimental  evaluation  demonstrating  the  correct  operating  of  this 
circuit. 


inductor  +jj  bias  SFQ/dc  converter 


(CD  2nd  Stage  (4K) 


<p>  1st  stage  (45-75K) 


Fig.  3.13.  A  new  ERSFQ  SFQ/dc  converter  with  output  junction  pair  fed  via  output  data  cable 
connecting  SFQ/dc  converter  and  VCSEL. 
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Fig.  3.14.  Micrograph  of  the  fabricated  test  circuits  with  a  new  ERSFQ  on-chip  driver  (SFQ/dc 
converter)  integrated  with  input  circuitry  (dc/SFQ  converter  and  Josephson  Transmission  Line 
(JTL)  for  the  driver  operation  and  verification. 


INPUT:  100  KHz  &  500  mV/div 

VVWVWVN 


OUTPUT:  lmV/div 


Fig.  3.15.  Correct  operation  of  ERSFQ  SFQ/dc  converter  with  output  stage  biasing 
implemented  via  output  data  line. 
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Superconducting  Ferromagnetic  Random  Access  Memory  Development 

The  goal  of  this  additional  task  is  to  perform  two  tasks  in  the  development  of  a  4K 
Superconducting  Ferromagnetic  MRAM  circuits  compatible  with  Josephson  junction  digital 
energy  efficient  SFQ  circuits.  Task  1  focuses  on  the  development  of  a  scalable,  energy-efficient 
memory  element  based  on  Magnetic  Josephson  junctions  (MJJs)  with  either  one 
(superconductor-insulator-superconductor-ferromagnetic-superconductor  (SIsFS)  MJJ)  or  two 
(superconductor-insulator-superconductor-ferromagnetic-superconductor-ferromagnetic- 
superconductor  (SIsFsFS)  MJJ)  ferromagnetic  layers.  For  SIsFS  MJJ,  we  demonstrated  earlier 
the  memory  properties  of  two  memory  states  with  different  critical  current  values  and  high  IcRn 
comparable  to  that  of  conventional  SIS  Josephson  junctions.  Task  2  is  devoted  to  the 
demonstration  of  a  superconducting  ferromagnetic  transistor  (SFT)  -  a  three-terminal  device 
with  good  input/output  isolation  for  integration  with  MJJ-based  memory  cell,  and  capable  of 
perfonning  the  memory  cell  selector  function  in  random  access  memory  arrays. 


Summary  of  the  project  accomplishments 

In  this  project,  we  have  achieved  the  following  results: 

1 .  Successful  demonstration  of  the  SIsFS  MJJ  memory  element  with  the  reduced  dimensions 
2  pm  x  2  pm  as  compared  to  the  previously  demonstrated  10  pm  x  10  pm  MJJ. 

a.  Fabricated  a  set  of  2  x  2  pm2  SIsFS  MJJs  with  a  Nb-Al/A10x-Nb-PdFe-Nb 
structure  using  HYPRES/InQubit  co-fabrication  approach. 

b.  By  applying  external  magnetic  field  pulses  and  biasing  MJJ  with  the  appropriate 
Read  current,  both  Write  and  Read  operations  showing  switching  between  “0”  and 
“1”  states  for  SIsFS  MJJ  were  demonstrated. 

2.  Development  of  the  SIsFsFS  MJJ  memory  element  scalable  to  sub-pm  dimensions. 

a.  As  a  first  step  here,  we  fabricated  a  set  of  10  x  10  pm  SFsFS  MJJs  with  Nb-PdFe- 
Nb-PdFe-Nb  structure. 

b.  Demonstrated  a  2.8%  difference  in  magneto-resistive  measurements  between 
parallel  (P)  and  anti-parallel  (AP)  orientation  of  ferromagnetic  layers. 

3.  Development  of  a  memory  cell  selector  -  a  superconductor-ferromagnetic  transistor  (SFT) 
with  high  input-output  isolation. 

a.  Fabrication  of  single  and  double-acceptor  SFIFSIS  superconducting  ferromagnetic 
transistors. 

b.  Demonstration  of  a  critical  current  modulation  for  SFT  acceptor  by  applying 
current  through  the  SFT  injector  for  the  single-acceptor  SFTs.  This  is  required  for 
SFT  input/output  isolation  to  realize  MJJ  memory  cell  with  the  integrated  SFT  cell 
selector. 

c.  Demonstration  of  a  voltage  gain  above  25  and  perfect  input/output  isolation  for 
double-acceptor  SFT. 
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Our  proposed  memory  cell  designs  takes  advantage  of  the  distinct  and  unique  characteristic  of 
our  MJJs  -  the  fast  switching  time  comparable  to  that  of  conventional  SIS  JJs  employed  in  low- 
power  fast  SFQ  circuits.  This  feature  allows  us  to  avoid  using  the  SIS  JJ  based  readout  SQUID 
elements  integrated  in  memory  cells  restricting  the  memory  cell  scalability  and  the  overall  RAM 
array  density.  In  order  to  achieve  the  fast  switching  time  of  MJJs,  we  combine  a  fast  SIS 
junction  with  a  magnetic  SFS  junction  in  a  single  tunnel  structure  [1-4]. 

This  leads  to  the  ability  to  utilize  MJJs  as  fast-switching  Josephson  junctions  with  programmable 
critical  current  -  the  new  feature  in  superconducting  electronics  not  available  in  the  past.  Such 
MJJs  can  be  directly  used  for  memory  cells  and  programmable  logic  in  the  most  area  efficient 
way  without  area  consuming  SQUIDs  and/or  additional  SIS  junctions.  The  memory  cell  can  now 
be  read  out  by  direct  electrical  switching  this  MJJ.  As  a  result,  the  memory  cell  area  will  be 
detennined  by  the  scalable  MJJ  rather  than  auxiliary  SIS  JJs  leading  to  a  very  dense  memory 
array  -  one  of  the  prime  motivations  for  the  cryogenic  MRAM  development.  Even  relatively 
large  MJJ  (e.g.  2x2  pm2)  can  be  quite  competitive,  since  it  does  not  require  any  additional 
SQUIDs. 

In  contrast,  the  known  alternative  approaches  of  using  superconducting-ferromagnetic  junction 
as  programmable  pi-shifters  lead  to  the  necessity  to  integrate  them  with  additional  SIS  junctions 
to  form  SQUIDs,  etc.  Such  memory  cells  have  to  be  readout  by  switching  additional  SIS 
junctions,  since  these  0-pi  phase  shifters  are  too  slow  (MHz-range)  to  be  readout  directly.  As  a 
result,  the  memory  cell  area  will  be  determined  not  by  the  scalable  MJJ  area  but  by  the  SQUID 
area  made  of  resistively  shunted  SIS  junctions  and  geometrically  large  inductors,  which 
compromise  the  memory  cell  density. 

Our  fast  MJJs  are  to  be  employed  in  two  possible  memory  cell  designs,  which  we  developed  at 
no  cost  to  this  project  as  HYPRES’  internal  R&D  effort.  These  designs  are  the  subject  of  a  patent 
application.  Both  memory  cell  designs  allow  the  energy-efficient,  small  area,  and  fast  memory 
cells  suitable  for  dense,  scalable  MRAM  applicable  for  cache,  main  memory  and  even  naturally 
extendible  for  multi-port  register  files.  The  first  design  is  a  single-MJJ  cell  with  a  ballistic  SFQ 
readout  (SFQ-MJJ  memory  cell).  The  second  design  follows  a  typical  MRAM  memory  cell 
configuration,  in  which  with  a  single  MJJ  storage  element  is  combined  with  a  3-terminal  SFT 
cell  selector  (SFT-MJJ  memory  cell). 

In  this  project,  we  focus  on  two  highest-priority  tasks: 

•  Scalable  (below  1  pm)  magnetic  Josephson  junction  with  two  ferromagnetic  layers  for 
the  use  as  a  storage  element  of  memory  cell. 

•  Superconducting-ferromagnetic  transistor  (SFT)  with  good  input/output  isolation  for  the 
use  as  a  memory  cell  selector. 

Fabrication  and  characterization  of  scalable  MJJ  memory  element  for 
cryogenic  memory 

The  specific  objective  of  the  HYPRES/InQubit  group  effort  is  to  explore  scalability  of  MJJ 
memory  elements  starting  from  the  already  demonstrated  10x10  pm2  memory  elements.  The 
goals  of  this  project  are  to  optimize  the  MJJ  design  and  fabrication  procedure  to  realize  MJJ 
minimal  (optimally  sub-pm)  dimensions  while  retaining  their  main  properties  (high  IcRn 
product,  ability  to  write-in  and  retain  information,  etc.)  required  for  application  as  a  memory 
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element.  At  the  beginning  of  this  project,  we  planned  to  optimize  the  MJJ  memory  element  with 
single  ferromagnetic  layer  (SIsFS)  and  scale  it  to  ~  pm  dimensions.  To  scale  the  MJJ  memory 
element  further  to  the  sub-pm  dimensions,  we  introduce  additional  sF  layers.  This  progression  is 
depicted  in  Fig.  5.1.  Thickness  of  the  additional  ‘s’  layer  should  ensure  its  superconductivity 
assisted  by  an  opposite  orientation  of  the  magnetization  vectors  of  F  films  with  significant 
(complete)  suppression  at  their  parallel  (P)  orientation  (see  Fig.  5.2).  In  the  first  case,  with 
opposite  (AP)  orientation  of  F  layer  magnetizations,  the  SIsFsFS  device  can  be  decomposed  to  a 
series  connection  of  Sis  tunnel  junction  with  sFs  and  s-F-S  sandwiches  with  higher  Ic  for  a  whole 
structure.  In  the  second  case,  with  co-aligned  (P)  magnetization  of  F  layers,  the  s  layer  between 
two  F  layers  transitions  into  the  normal  state  significantly  decreasing  structures’  Ic. 


SIsFS  SIsFsFS 


SIsFsFS  (AP)  SIsFsFS  (P) 


Fig.l.  MJJ  memojy  element  progression. 


Fig.  2.  SIsFsFS  memory  element  in  AP  and  P 
states. 


This  report  covers  the  research  activity  for  the  period  from  September  13,  2013  to  August  30, 
2014.  The  subtasks  were: 

•  Sub-Task  1.1:  Fabrication  of  SIsFS  MJJ  with  2x2  pm“  dimension  using  earlier 
established  HYPRES/InQubit  co-fabrication  approach  (details  follow). 

•  Sub-Task  1.2:  Characterization  (I-V curves,  Ic(H),  switching  curves,  etc.)  of  MJJs 
produced  in  sub-task  1.1  for  the  application  as  a  cryogenic  memory  element. 

•  Sub-Task  1.3:  Selection  of  the  optimal  parameters,  e.g.,  layer  thicknesses  for  two 
ferromagnetic  layers  and  superconducting  layers  between  them  for  SIsFsFs  memory 
element. 

•  Sub-Task  1 .4:  Fabrication  and  characterization  of  SFsFS  MJJ  produced  with  layer 
thicknesses  selected  in  sub-task  1.3. 

2.1  Fabrication  of  SIsFS  MJJs 

Our  fabrication  process  is  based  on  the  established  and  proved  HYPRES/InQubit  co-fabrication 
approach.  The  fabrication  was  split  into  two  major  steps.  Firstly,  we  produced  a  series  of  150- 
mm  wafers  with  an  in-situ  deposited  Nb-Al/A10x-Nb  trilayers  with  4.5  kA/cm2  target  Josephson 
critical  current  density.  The  wafers  were  then  diced  into  15x15  mm  samples  and  transferred  to 
the  InQubit/ISSP  for  subsequent  deposition  of  a  ferromagnetic  layer  (Pd0.99Fe0.01)  and  top  Nb 
counter  electrode.  The  resultant  structure  is  of  a  SIsFS  type.  For  the  Sis  fabrications,  the 
thickness  of  the  counter  electrode  was  20  nm  whereas  for  Nb  base  electrode  thickness  was  120 
nm.  At  the  InQubit  facility,  the  samples  were  cleaned  in  acetone/methanol/IPA  and  blow  dried 
with  N2  gas.  In-situ  Ar  sputter  etching  was  used  to  remove  about  10  nm  of  Nb  oxide  layer  before 
depositing  ferromagnetic  layer.  The  PdFe/Nb  bilayer  was  deposited  using  rf-  and  dc-magnetron 
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sputtering.  The  PdFe  layer  was  25nm  thin,  enough  to  avoid  significant  critical  current 
suppression.  The  top  Nb  layer  thickness  was  about  150  rnn  to  ensure  unifonn  supercurrent  flow 
through  a  Josephson  junction.  Then,  we  formed  a  square  mesa  of  2  x  2  pm2  sizes  by 
photolithography,  reactive  ion  etching  (RIE)  of  the  top  Nb  layer  and  argon  plasma  etching  of 
PdFe  and  Al/A10x  layers,  and  patterned  the  bottom  Nb-electrode  with  photolithography  and  RIE. 
At  the  third  step,  we  formed  an  isolation  layer  with  a  contact  (wiring)  window  by  using  thermal 
evaporation  of  SiO  and  a  lift-off  process.  Junction  contact  size  in  SiO  layer  was  4x4  pm  .  At  the 
last  step  we  fonned  Nb  wiring  electrode  of  450  nm  thickness  using  magnetron  sputtering  and  a 
lift-off.  We  used  argon  RF-etching  to  ensure  a  good  interface  transparency  between  the  wiring 
and  the  top  Nb  electrode  of  the  mesa.  The  resultant  structure  is  shown  in  Fig.  5.3. 


Nb  wiring 


Base  Nb 


Substrate 


Fig.  5.3.  The  MJJ  structure  produced  using  HYPRES/InQuhit  MJJ  co-fabrication  process. 


2.2  Characterization  of  2  x  2  pm2  SIsFS  MJJ  samples 

All  SIsFS  measurements  were  performed  in  a  variable  temperature  liquid  He  cryostat.  The 
sample  holder  was  placed  in  a  vacuum  can  with  He  gas  added  for  better  heat  exchange.  The 
sample  temperature  was  controlled  using  a  carbon  thermometer  and  a  compact  heater  wounded 
by  a  twisted  pair  and  glued  close  to  the  sample.  Fig.  5.4  shows  the  measured  current-voltage 
characteristics  (CVC)  of  SIsFS  MJJ  at  4.2  K  temperature.  The  unshunted  MJJ  is  susceptible  to 
the  external  noise  and  premature  switching  to  voltage  state.  To  measure  MJJ  critical  current  Ic, 
the  post-process  shunting  by  0.01  Q  resistor  made  of  A1  wire  was  performed  and  CVC  shown  in 
Fig.  5.4  was  obtained.  With  critical  current  density  of  4.5  kA/cm2,  the  expected  Ic  for  MJJ  with 
dimensions  of  2  x  2  pm“  is  about  180  pA.  Critical  current  Ic  =60  pA  at  4.2  K  shown  in  Fig.  5.4 
can  be  explained  both  by  degradation  of  quality  of  Nb  in  intermediate  s-layer  due  to  excessive  Ar 
ion  cleaning  and  suppression  of  superconductivity  in  MJJ  by  ferromagnetic.  As  expected,  the  Ic 
of  MJJ  under  investigation  rose  up  while  junction  temperature  was  decreased  reaching  lc  =165 
pA  for  T=  2.1  K  (see  Fig.  5.5). 

The  switching  experiment  was  performed  for  a  2  x  2  pnr  unshunted  MJJ  device  at  temperature  T 
=  3.4  K  with  I-V-curve  presented  in  Fig.  5.6.  An  application  of  small  external  magnetic  field 
changed  the  magnetization  of  the  ferromagnetic  layer  that  in  turn  changes  the  junction  Ic, 
allowing  the  realization  of  two  distinct  states  with  high  and  low  Ic,  corresponding  to  logical  “0” 
and  “1”  states,  respectively.  Thus,  one  can  choose  a  junction  bias  current  ( Iread  =  75  pA)  to 
switch  the  SIsFS  MJJ  from  a  superconducting  to  a  resistive  state  by  a  pulse  of  magnetic  field. 
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This  experiment  is  presented  in  Fig.  5.7,  where  positive  and  negative  magnetic  field  pulses 
switch  the  SIsFS  junction  from  a  superconducting  (zero-resistauce)  or  “0”  state  to  a  resistive 
or“l”  state  and  back. 


Fig.  5.4.  Current-voltage  characteristic  (CVC)  of  2 
x  2  pm2  SIsFS  MJJ  externally  shunted  with  0.03  Q 
resistor  taken  at  T  =  4.2  K. 


Fig.  5.5  CVC  of  2  x  2  pin2  SIsFS  MJJ  externally 
shunted  with  0.03  Q  resistor  taken  at  T  =  2.1  K. 


Fig.  5.6.  I-V  characteristic  (CVC)  of  2  x  2  pm2  Fig.  5.7.  Confirmation  of  Write/Read  operation: 
unshunted  SIsFS  MJJ  taken  at  T  =  3.4  K.  switching  of  SIsFS  MJJ  between  “0"  and  “1” 

states  by •  remagnetization  with  external  magnetic 
field.  V(t)  -  average  junction  voltage,  H(t)  - 
applied  magnetic  field. 

1.3  Selection  of  parameters  for  SISFsFS  MJJ  memory  element  for  cryogenic  memory 

2.3.1  Thickness  of  the  ‘s’  interlayer  between  two  ‘F’  layers 

Fust,  we  have  to  find  the  optimal  thickness  of  Nb  interlayer  between  two  ferromagnetic  Pdo.99 
Feo  01  layers.  This  Nb  interlayer  has  to  be  thin  enough  for  whole  structure  to  operate  as  a  single 
MJJ.  Additionally,  it  has  to  be  thick  enough  for  its  superconducting  properties  not  to  be 
completely  suppressed  by  proximity  effect  from  neighboring  ferromagnetic  layers. 
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We  fabricated  and  studied  (FsF)  PdFe-Nb-PdFe  trilayer  with  thicknesses  of  PdFe  layers  of  20  urn 
and  25  inn.  We  observed  that  with  the  Nb  interlayer  of  15  urn  this  trilayer  has  critical 
temperature  of  Tc= 2.62  K  (see  Fig.  5.8)  which  proves  that  in  this  case  superconducting  properties 
of  Nb  are  strongly  affected  by  proximity  effect  from  ferromagnetic  PdFe.  For  trilayer  with  Nb 
interlayer  thickness  of  10  run,  Tc  <  1  K  was  measured,  making  it  impractical.  Larger  thickness  of 
Nb  interlayer  is  required. 


Fig.  5.8.  Superconducting  transition  for  PdFe-Nb-PdFe  (20  nm-15  nm-25  inn)  trilayer  affected  by 
proximity  effect. 

2.3.2  Thicknesses  of  ferromagnetic  ‘F’  layers 

As  a  next  step,  we  are  finding  the  thicknesses  for  ferromagnetic  layers  to  realize  AP  and  P  states 
with  higher  and  lower  Ic,  respectively.  We,  fust,  try  to  realize  AP  and  P  states  in  MJJ  with  two  F- 
layers  of  Pdo  99Feo.oi  of  different  thickness.  We  study  IC(H)  magnetization  curves  for  SIsFS  (Nb- 
Al/AlOx-Nb-PdFe-Nb)  MJJs  co-fabricated  by  Hypres/InQubit/ISSP  as  described  in  section  2.1. 

Figs.  9-12  show  hysteretic  dependences  of  critical  current  Ic  on  magnetic  field  for  SIsFS  MJJs 
with  a  20  run  ferromagnetic  layer  (Figs.  9  &  11)  and  a  25  run  ferromagnetic  layer  (Figs.  10  & 
12).  Figs.  9-12  demonstrate  a  typical  Fraunhofer-like  IC(H)  dependence  of  MJJ  critical  current 
on  external  magnetic  field  H  aligned  parallel  to  the  layers  within  the  MJJ  and  supplied  by  an 
external  solenoid.  For  each  measured  point  of  the  IC(H)  curve,  the  current  through  the  sample  is 
swept  from  a  subcritical  value  upwards  imtil  the  threshold  voltage  is  exceeded  while  the  external 
magnetic  field  is  fixed.  For  all  IC(H)  dependencies  magnetic  field  sweep  directions  are  shown  by 
aiiows  in  Fig.  5.9. 

From  Fig.  5.9,  one  can  conclude  that  coercive  field  for  the  20  run  F-layer  at  temperature  T—  2.5 
K  is  Hc~  -2.2  Oe.  Coercive  field  by  the  MJJ  is  defined  by  the  magnetic  field  when  both  magnetic 
induction  B  and  magnetic  flux  <P  of  the  MJJ  are  zero.  Consequently,  the  critical  current  of  the 
MJJ  should  beat  its  maximum  in  the  coercive  field.  Similarly,  from  Fig.  5.10,  one  can  conclude 
that  coercive  field  for  the  25  inn  F-layer  at  temperature  T  =  2.5  K  is  Hc  ~  -3.35  Oe. 
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Fig.  5.9.  IC(H )  dependence  for  Nb-Al/AlOx-Nb-  Fig.  5.10.  IC(H )  dependence  for  Nb-Al/AlOx-Nb- 
PdFe-Nb  SIsFS  MJJ  with  a  20  nm  ferromagnetic  PdFe-Nb  SIsFS  MJJ  with  a  25  nm  ferromagnetic 
layer.  layer. 


IC(H)  hysteretic  dependence  in  Fig.  5. 1 1  shows  that  for  a  frill  remagnetization  MJJ  with  a  20  mil 
PdFe  has  to  be  placed  in  an  external  magnetic  field  higher  than  5.2  Oe.  At  the  same  time. 
Fig.  5.12  shows  that  the  fully  magnetized  MJJ  with  25  nm  PdFe  remains  magnetized  at  zero 
field.  Fig.  5.12  was  measured  sweeping  external  magnetic  field  from  very  large  positive  H  ~  25 
Oe  (ensuring  PdFe  full  magnetization)  to  H  ~  -5.2  Oe  and  then  back  to  H  ~  25  Oe.  For  red 
IC(H)  curve,  at  ~  -2.2  Oe  zero  magnetic  field,  lc  is  suppressed  proving  that  25  nm  PdFe  still 
remains  magnetized. 


H  (Oe) 

Fig.  5.11.  IC(H)  dependence  for  SIsFS  MJJ  with  a 
20  nm  ferromagnetic  layer. 


Fig.  5.12.  IC(H)  dependence  for  SIsFS  MJJ  with  a  25 
nm  ferromagnetic  layer. 


Therefore,  the  opposite  magnetization  for  two  F-layers  can  be  realized  in  MJJ  with  20  nm  and  25 
nm  PdFe  layers.  Moreover,  to  make  this  magnetization  difference  even  more  pronounced,  we 
study  MJJ  with  20  nm  and  30  mu  PdFe  layers. 


1.4  Fabrication  and  evaluation  of  SFsFS  test  MJJs 

As  a  first  step  towards  the  development  of  the  fast  MJJs  based  on  SIsFsFS,  we  fabricated  and 
tested  simpler  SFsFS  MJJs  with  Nb-PdFe-Nb-PdFe-Nb  10  pm  x  10  pm  structures  and  respective 
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layer  thicknesses  120-30-15-20-450  nanometers.  The  Nb-PdFe-Nb-PdFe-Nb  (120-30-15-20-150 
run)  was  formed  by  if-  and  dc-magnetron  sputtering  and  a  10  x  10  pm2  mesa  is  formed  by: 

1 .  Reactive  Ion  Etching  (RIE)  of  top  Nb, 

2.  Barrier  geometry  is  formed  by  RIE  of  PdFe  and  Nb  interlayer, 

3.  Patterning  the  bottom  Nb-electrode  with  photolithography  and  RIE, 

4.  An  isolation  layer  with  a  contact  (wiring)  window  by  using  thermal  evaporation  of  SiO 
and  a  lift-off  process  is  formed, 

5.  Nb  wiring  electrode  of  450  nm  thickness  using  magnetron  sputtering  and  a  lift-off  is 
formed. 

Fig.  5.13  presents  IC(T)  dependencies  for  SFsFS  MJJ  with  three  normalized  maxima  shown,  i.e., 
the  main  IC(H)  peak  in  the  demagnetized  state  as  well  as  the  fust  and  second  negative  IC(H)  peaks 
in  the  magnetized  state.  With  the  MJJ  being  cooled,  all  curves  in  Fig.  5.13  show  crossover  at  T  ~ 
2.25  K  with  drastic  increase  in  Ic  that  reflects  s-layer  between  F-layers  becoming 
superconducting.  Fig.  5.13  does  not  show  any  effect  of  stimulation  of  Ic  by  the  external 
magnetic  field  with  all  peaks  (main  and  secondary)  following  the  same  temperature  dependence. 
This  might  indicate  that  two  PdFe  layers  acted  as  a  single  one.  This  assumption  is  confirmed  by 
the  fact  that  the  coercive  field  (Hc)  on  IC(H)  dependences  account  for  single  -50  nm  PdFe  layer. 

Still,  we  were  able  to  distinguish  P  and  AP  states  at  lowest  temperature  available  in  our 
measurements  system  (see  Fig.  5.14).  Fig.  5.14  shows  normalized  CVCs  for  SFsFS  MJJ  in  P 
(solid  black)  and  AP  (solid  red)  states  with  a  magneto-resistive  effect  of  2.8%  difference 
between  P  and  AP  being  measured.  Since  this  is  not  yet  sufficient  for  memory  application  at  T  - 
4.2  K,  further  material  research  in  optimization  this  double  ferromagnetic  layer  memory  element 
is  required.  This  may  include  different  composition  of  PdFe  (i.e.,  the  increase  of  Fe  content  from 
current  1%)  and/or  different  thickness(s)  of  PdFe  layers. 


Fig.  5.13.  IC(T)  dependence  for  SFsFS  MJJ. 


Fig.  5.14.  SFsFS  MJJ  magneto-resistive  measurements 
atT  =  1.2K 
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2.  Fabrication  and  characterization  of  a  Superconductor-Ferromagnetic 
Transistor  (SFT)  for  MJJ  memory  cell  with  integrated  SFT  cell  selector 

The  specific  objective  of  the  HYPRES/Northwestern  University  (NWU)  team  effort  is  to  develop 
a  superconducting  ferromagnetic  SFIFSIS  three-terminal  device  (Fig.  5.15)  with  a  good 
input/output  isolation  as  a  memory  cell  selector  integrated  with  MJJ.  The  goal  of  this  task  is  to 
prove  that  SFT  has  transistor-like  properties  necessary  for  the  intended  use  as  a  memory  cell 
selector.  These  are:  (i)  an  ability  to  modulate  critical  current  between  two  device  terminals  by 
injecting  current  from  the  3ld  tenninal;  (ii)  a  good  input/output  isolation  eliminating  a  potential 
half  select  issue.  This  device  will  be  based  on  the  SFIFSIS  structures,  which  have  already 
showed  good  isolation  properties  [6], 

This  final  report  covers  the  research  activity  from  September  13,  2013  to  August  30,  2014.  The 
primary  activities  were: 

•  Sub-Task  2.1:  Fabricate  a  set  of  SFIFSIS  SFT  devices  with  the  goal  to  increase 
input/output  isolation  and  demonstrate  the  modulation  of  the  acceptor  critical  current  by 
input  current  applied  to  the  device  injector  terminal. 

•  Sub-Task  2.2:  Experimentally  evaluate  the  fabricated  SFIFSIS  SFT  devices  to  investigate 
the  levels  of  critical  current  modulation  and  input/output  isolation  in  the  regime  applicable 
for  memory  readout  process. 

3.1  Fabrication  of  SFIFSIS  superconductor-ferromagnetic  transistor 

In  this  project  period,  the  focus  of  our  work  was  on  fabrication  of  superconductor-ferromagnet 
three-terminal  devices  and  testing  their  characteristics  at  4.2  K.  The  devices  were  patterned  from 
SFIFSIS  multilayers,  where  S  (superconductor)  stands  for  Nb;  F  (ferromagnetic  material)  stands 
for  Ni;  and  I  (insulator)  stands  for  Al/A10v.  For  patterning  of  these  devices,  NWU  used  new 
photomasks  provided  earlier  by  HYPRES  at  no  cost  to  this  project.  The  goal  was,  first,  to  make 
the  contact  pads  of  the  same  configuration  as  HYPRES  typically  uses  in  order  to  standardize  the 
chips  and  make  their  characterization  compatible  with  HYPRES,  and  second,  to  reduce  the  size 
of  devices. 

Here  we  follow  the  technique  for  fabrication  of  multi-terminal  SFIFSIS  devices  consisting  of 
two  stacked  junctions  developed  earlier  in  NWU.  The  three-terminal  devices  were  fabricated 
from  Nb(  1 20)/N i(4)/A  1/A  10vf 9)/N i(4)/Nb(30 )/A  1/A  10v(9)Nb(80)  multilayers  with  numbers  in 
parenthesis  indicate  nominal  thicknesses  of  the  respective  layers  in  nm  (for  Al/A10x  it  is  the 
initial  thickness  of  unoxidized  Al).  Anodization  and  deposition  of  SiCF  was  used  for  proper 
insulation  of  the  electrodes  from  each  other.  The  structure  of  a  three-terminal  SFIFSIS  device 
together  with  the  experimental  technique,  i.e.,  the  way  how  bias  currents  are  applied  and  voltage 
is  measured,  is  shown  in  Fig.  5.16.  Fig.  5.17  shows  the  top  view  microphotograph  of  the  actual 
single  and  double-acceptor  devices. 

We  succeeded  in  fabrication  single-acceptor  and  double-acceptor  three-tenninal  devices 
superconductor-ferromagnet  transistors  (SFTs).  In  latter  case  two  acceptor  SIS  junctions  are 
placed  on  top  of  the  same  injector  SFIFS  junction  (see  schematic  side  view  of  the  device 
structure  in  Fig.  5.16).  The  devices  were  made  on  sapphire  substrates.  If  the  two  acceptor 
junctions  are  measured  in  series,  as  shown  in  Fig.  5.16,  then  the  output  voltage  is  doubled  as 
compared  with  that  in  an  ordinary  single-acceptor  device  similar  to  that  described  in  [5].  This 
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configuration  might  be  favorable  if  the  device  is  used  in  the  voltage  amplifier  mode  [6],  which  is 
useful  for  investigation  of  the  SFT  properties. 


Fig.  5.15.  Schematic  side  view  and  symbol  of  Fig.  5.16.  Schematic  cross-sectional  view  and  biasing 
the  SFT.  The  choice  of  cell-selector  SFT  design  for  the  single-acceptor  (a)  and  double-acceptor  (b) 
with  a  single  or  double  acceptor  will  be  SFT  device, 

determined  in  this  project. 


(a)  (b) 


Fig.  5.17.  Optical  image  of  actual  single  (a)  and  double  (b)  acceptor  SFT  devices. 


3.2  Evaluation  of  SISFIFS  superconductor-ferromagnetic  transistor 


We  report  data  for  three  devices,  D1-D3,  with  some  of  then  parameters  being  summarized  in  the 
Table  1. 
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Table  1.  SFT  Device  Parameters. 


Device 

No. 

Substrate 

No.  of 

acceptors 

4>  ~ 

.a  e 

z  i 

o  x 

S  E 
a*  3 

Acceptor  size 

(pm  x  pm) 

(ura)  vwp 

O  -C 

E  <■> 

E  03 
O  W 

78  <£ 

of  2  layers 

xr 

t-  E 

I  5 

O  X 

a 

0  <0 

H.  _  6 

Q;  0 
r* 

<  3 

Di 

A1203 

1 

5  x  7.5 

5x5 

30 

2 

3.1  X  10’7 

3.3  X  10‘7 

d2 

Si/Si02 

1 

10  x  12 

8  x  10 

45 

2 

1.3  x  10-6 

5.5  x  10'7 

d3 

ai2o3 

2 

10  x  12 

4x8 

35 

2 

3.9  x  10’7 

2.5  x  10'7 

3.2.1  Modulation  of  Josephson  current  in  the  single-acceptor  devices 

Similarly  to  a  semiconductor  transistor  performing  the  cell  selection  function  in  room- 
temperature  magnetic  random  access  memory  (MRAM)  for  cryogenic  RAM  applications,  it  is 
important  to  be  able  to  control  Josephson  current  with  an  SFT  acting  as  a  sell  selector. 

Quasipaiticle  injection  from  SFIFS  junction  suppresses  the  energy  gap  in  the  middle  (Nb2) 
electrode,  thus  resulting  in  suppression  of  the  Josephson  current  in  the  SIS  (acceptor)  junction. 
We  demonstrate  this  using  the  single-acceptor  SFTs  D1  and  D2.  Fig.  5.18  shows  CVC  of  the 
acceptor  (black  curve  1)  and  injector  (blue  curve  2)  junctions  for  device  D2.  Red  curve  3  is 
initial  portion  of  the  acceptor  CVC  recorded  in  an  applied  magnetic  field  corresponding  to  the 
second  minimum  of  the  Ic  vs.  H  dependence  (where  Ic  is  the  critical  Josephson  current).  This 
curve  displays  the  gap  difference  feature,  which  allows  us  to  determine  the  superconducting 
energy  gaps  of  the  middle  Nb2  and  the  top  Nb3  electrodes  to  be  0.86  meV  and  1.22  meV, 
respectively. 

We  measured  the  Ic  vs.  H  dependence  for  the  SIS  (i.e..  Nb2/Al/A10x/Nb3)  acceptor  junction  of 
device  D2  at  different  levels  of  current  through  the  injector  junction  SFIFS.  These  data  are 
shown  in  Fig.  5.19.  Curves  from  top  to  bottom  are  for  the  injection  current,  /„  from  0  to  4  mA 
applied  with  the  0.4  mA  increment.  Regular  shape  of  the  Ic  vs.  H  dependence  is  preserved  up  to 
high  injection  current;  at  /,  =  4.0  ruA  the  dependence  is  distorted,  which  may  be  due  to  several 
reasons:  (1)  trapping  the  magnetic  flux;  (2)  development  of  an  inhomogeneous  gap  state  under 
quasiparticle  injection  on  the  scale  of  diffusion  length,  or  (3)  transition  into  the  C  -state  under 
influence  of  spin  injection. 

At  the  same  time,  we  continue  testing  of  the  already  made  devices  with  two  acceptors  in  order  to 
study  the  voltage  amplification  by  analogy  with  our  previous  work  [6],  We  expect  to  achieve 
higher  voltage  amplification  in  these  devices.  Our  preliminary  measurements  of  these  devices 
confirmed  that  they  have  excellent  input-output  isolation,  similarly  to  the  devices  with  a  single 
acceptor  reported  earlier  [6,  7], 
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Fig.  5.18.  Current-voltage  characteristics 
( CVCs )  for  SFT  deiice  D2  at  4.2  K.  Black 
cune  1  is  CVC  of  the  acceptor  SIS 
junction,  blue  curve  2  is  CVC  of  the 
injector  SFIFS  junction,  and  red  curve  3  is 
an  initial  portion  of  the  acceptor  CVC  in 
an  applied  magnetic  field  corresponding  to 
the  2nd  minimum  of  the  Ic  vs.  H  dependence. 
Gap  difference  feature  is  seen  in  the  latter 
curve. 


Fig.  5.19.  Ic  vs.  H  dependence  for  the  acceptor 
junction  of  the  dev  ice  D2  at  different  levels  of  the 
injection  current.  Clin  es  from  top  to  bottom  are 
for  the  injection  current  from  0  to  4  mA  applied 
with  the  0.4  mA  increment. 


Fig.  5.20.  Maximum  Josephson  current  of  the 
acceptor  junction  vs.  the  current  through  the 
injector  junction  for  deiices  D\  and  D2. 


Fig.  5.20  shows  maximum  Josephson  current  as  a  function  of  the  injection  current  level  for 
devices  D1  and  D2.  The  data  demonstrate  possibility  to  modulate  Josephson  current  by 
quasiparticle  injection  in  SFT  devices.  The  further  optimization  of  the  devices  in  order  to  achieve 
more  efficient  modulation  is  underway. 

3.2.2  Amplification 

Next  we  study  voltage  amplification  in  the  double-acceptor  SFTs  exemplified  by  device  3.  The 
experiment,  illustrated  in  Fig.  5.21,  was  carried  out  at  4.2  K.  In  Fig.  5.21,  Curve  1  is  CVC  of  the 
injector:  curve  2  is  CVC  of  the  double  acceptor  recorded  at  zero  injection  current  in  an  applied 
magnetic  field  of  250  Oe;  curve  3  is  the  same  CVC  but  under  influence  of  the  injection  current 
corresponding  to  the  DC  bias  point  A  in  cuive  1 .  If.  in  addition  to  DC  bias  ament,  a  small  AC 
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signal  is  applied  in  point  A  (an  image  of  the  oscilloscope  screen  displaying  this  signal  is  shown 
on  bottom  in  the  lower  inset)  then  one  obtains  an  output  signal  (shown  on  top  in  the  lower  inset) 
in  the  operation  point  B  of  the  double-acceptor  CVC.  The  vertical  (voltage)  scale  is  20  pV  per 
division  for  the  input  signal,  and  500  pV  per  division  for  the  output  signal.  Horizontal  scale  is  5 
ms/division  in  all  cases.  The  peak-to-peak  amplitude  of  the  input  signal  is  20  pV,  whereas  the 
peak-to-peak  amplitude  of  the  output  signal  is  about  600  pV;  therefore,  the  voltage  gain  is  about 
30.  In  the  reverse  transmission  experiment,  the  input  signal  (top  curve  in  the  upper  inset)  was  fed 
at  the  DC  bias  point  D  of  the  double-acceptor  CVC,  and  the  output  signal  (thick  line  on  bottom 
in  the  upper  inset)  was  acquired  at  the  operation  point  C  of  the  injector  CVC.  The  voltage  scale 
is  500  pV  per  division  for  the  input  signal,  and  5  pV  per  division  for  the  output  signal  proving 
very  good  input/output  isolation  in  our  SFT  device  similarly  to  the  devices  with  a  single  acceptor 
reported  earlier  [6,  7], 


Voltage  (mV) 

Fig.  5.21.  Demonstration  of  the  voltage  amplification  experiment  on  device  Dj.  Cune  1  is  CVC  of  the 
injector;  cun  e  2  is  the  unperturbed  CVC  of  the  double  acceptor  in  an  applied  magnetic  field  of  250 
Oe;  cun’e  3  is  the  same  CVC  but  under  influence  of  the  injection  current  coiresponding  to  the  DC  bias 
point  A.  If  a  small  AC  signal  is  applied  in  point  A  (bottom  signal  in  the  lower  inset )  then  one  obtains 
an  output  signal  (shown  on  top  in  the  lower  inset)  in  the  operation  point  B  of  the  acceptor  CVC.  The 
vertical  (voltage)  scale  is  20  pV  per  division  for  the  input  signal,  and  500  pV  per  division  for  the 
output  signal.  (Horizontal  scale  is  5  ms/division  in  all  cases.)  One  can  infer  the  voltage  gain  above  25. 
In  the  re\’erse  transmission  experiment,  the  input  signal  (top  cun  e  in  the  upper  inset)  was  fed  at  the 
DC  bias  point  D  of  the  acceptor  CVC,  and  the  output  signal  (thick  line  in  the  upper  inset)  Mas 
acquired  at  the  operation  point  C  of  the  injector  CVC.  The  voltage  scale  is  500  pV per  division  for  the 
input  signal,  and  5  pV  per  division  for  the  output  signal.  This  experiment  indicates  very  good 
input/output  isolation. 
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Conclusions 

During  this  project  period,  we  made  significant  progress  in  two  important  directions  achieving 

significant  results  important  for  the  scalable  and  energy-efficient  memory  cell  implementation 

critical  for  the  cryogenic  dense  RAM  development  based  on  superconducting-ferromagnetic 

tunnel  junctions: 

1 .  Fabricated  and  successfully  tested  a  new  set  of  SIsFS  MJJ  devices  with  smaller  size  (2  x 
2  pm2).  Demonstrated  that  with  the  reduction  of  MJJ  size,  we  were  able  to  retain  all 
properties  of  our  memory  device.  It  showed  the  ability  to  be  programmed  (Write 
operation),  readout  (Read  operation)  with  high  IcRn  product  (200  pV).  To  our  knowledge, 
this  is  the  only  memory  element  based  on  a  magnetic  Josephson  junction  and 
demonstrated  Write/Read  operation.  Similar  results  were  only  demonstrated  with  non- 
scalable  long  Josephson  junctions  (200  pm)  with  smaller  IcRn  [8]. 

For  further  scaling  memory  element  to  the  sub-pm  dimensions,  we  fabricated  and  tested 
an  MJJ  with  two  ferromagnetic  layers  and  demonstrated  in  magneto-resistive 
measurements  -2.8%  critical  current  difference  between  parallel  (P)  and  anti-parallel 
(AP)  orientation  of  ferromagnetic  layers. 

2.  Fabricated  and  tested  a  new  set  of  SISFIFS  SFT  devices  with  single  and  double  acceptor 
configuration.  Demonstrated  a  modulation  of  critical  current  of  acceptor  by  current  via 
injector  -  the  key  property  of  the  SFT  for  the  intended  use  as  a  memory  cell  selector. 
Demonstrated  a  voltage  gain  above  25  and  perfect  input/output  isolation  for  the  SFT. 
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Conclusion 


During  the  project,  we  have  achieved  the  following  results: 

•  We  have  designed  an  8-bit  Arithmetic-Logic  Unit  (ALU)  in  RSFQ  technology.  The 
design  is  based  on  Kogge-Stone  CLA  adder  and  employs  wave-pipeline  architecture.  The 
ALU  was  fabricated  and  successfully  tested  at  20-GHz  clock  frequency,  the  major  and 
critical  part  of  the  proposed  processor  datapath.  The  ALU  was  fully  tested  for 
functionality  and  operational  margins.  At  low  speed,  the  measured  critical  margin  for  bias 
current  was  +/-  7%.  Using  an  on-chip  test-bed  based  on  the  controlled  SFQ  relays,  we 
have  fully  tested  the  ALU  at  20-GHz  clock  frequency. 

•  We  have  designed,  fabricated,  and  tested  an  8-byte  Register  File  comprising  a  matrix  of 
two  banks  of  four  8-bit  registers  integrated  with  control  logic  block.  The  complete  8x8- 
bit  Register  File  was  successfully  demonstrated.  Besides  the  data  port  operation,  all  64 
memory  cells  of  the  register  file  were  tested  individually  at  the  nominal  bias  current.  The 
operational  margins  for  dc  bias  current  were  varying  from  -14%  /  +25%  to  -1%  /  +2%. 

•  In  an  additional  effort,  we  have  developed  a  novel  resistor-free  biasing  scheme  for  RSFQ 
with  zero  static-  and  minimal  dynamic  power  dissipation.  We  called  it  energy-efficient 
RSFQ  or  ERSFQ.  It  is  fully  compatible  on  a  cell  level  with  resistive  RSFQ  logic  allowing 
us  to  utilize  RSFQ  cell  library  with  minor  modifications.  Using  this  approach,  we  have 
designed  and  successfully  demonstrated  at  high  (up  to  60  GHz)  speed  a  number  of 
circuits  including  a  static  frequency  divider  by  2"  ,  a  detector  digital  readout  (ADC),  and 
two  types  of  an  8-bit  parallel  adder.  The  main  achievement  in  energy  dissipation 
reduction  was  demonstration  of  ERSFQ  8-bit  parallel  adder  dissipating  160  aJ  per 
operation.  All  investigated  ERSFQ  circuits  have  shown  no  perfonnance  degradation 
comparing  to  their  RSFQ  counterparts  and  in  some  cases  even  excelled  them. 

•  The  8-bit  ALU  was  designed  in  new  ERSFQ  technology.  The  new  ALU  architecture  is 
based  on  wave-pipeline  ripple-carry  adder  featuring  high  throughput  (simulated  44  GHz 
at  4.5  kA/cnL  process),  asynchronous  carry  propagation  and  small  latency.  At  the  same 
time,  it  operates  with  high  data  skew  factor  that  should  be  matched  by  the  register  file. 

•  The  other  additional  goal  of  this  multi-phase  project  is  to  develop  and  demonstrate  the 
energy-efficient  output  data  interface  between  cryogenic  4  K  superconducting  modules 
and  room-temperature  semiconductor  systems  using  a  combination  of  energy-efficient 
on-chip  drivers,  low  loss  and  dispersion  cables,  and  polarization  modulating  vertical- 
cavity  emission  lasers  (PM  VCSELs).  During  this  project  period,  we  completed  the 
fabrication  and  testing  of  the  second  generation  designs  of  PM  VCSELs  with 
modifications  introduced  during  the  previous  project  period.  This  new  design  is  based  on 
a  “half-VCSEL”  structure  with  dielectric  top  distributed  Bragg  reflector  (DBR).  We 
fabricated  the  first  iteration  of  VCSEL  devices  with  dielectric  DBR  and  demonstrated 
improvement  in  performance  although  with  lesser  polarization  control.  The  second 
fabrication  iteration  to  address  polarization  control  is  80%  complete.  Preliminary  testing 
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of  these  devices  before  deposition  of  the  top  dielectric  DBR  mirror  shows  diode  current- 
voltage  characteristics,  and  clear  electroluminescence  was  observed.  In  addition,  we  have 
completed,  optimized  and  employed  a  cryogenic  setup  for  cryogenic  VCSEL  testing  in 
wide  temperature  range.  It  is  based  on  Sumitomo  two-stage  cryocooler  with  accurate 
temperature  control  of  the  first  stage.  The  measurement  process  and  data  collection  is 
perfonned  using  the  developed  for  the  measurements  Labview  program.  We  tested  a  set 
of  VCSEL  samples  produced  by  Univ.  of  Illinois  team  to  verify  and  calibrate  our 
cryogenic  setup.  We  also  fabricated  and  successfully  tested  new  on-chip  energy-efficient 
driver  based  on  ERSFQ  logic.  These  drivers  are  based  on  dc/SFQ  converters  re-designed 
to  ERSFQ  logic.  The  bias  of  the  driver  output  stage  was  implemented  via  the  output  data 
line  from  the  PM  VCSEL. 

•  Another  added  task  was  the  development  of  approaches  to  maximizing  energy-efficiency 
of  SFQ  digital  circuits.  We  perfonned  the  first  experimental  demonstration  of  recently 
proposed  energy-efficient  single  flux  quantum  logic  with  zero  static  power  dissipation, 
eSFQ.  We  also  demonstrate  that  the  introduction  of  passive  phase  shifters  allows  the 
reduction  of  dynamic  power  dissipation  by  about  20%.  Two  types  of  demonstration 
eSFQ  circuits,  shift  registers  and  demultiplexers  (deserializers),  were  implemented  using 
the  standard  HYPRES  4.5  kA/cm2  fabrication  process. 

•  The  goal  of  this  additional  task  is  to  perfonn  development  of  a  4K  Superconducting 
Ferromagnetic  MRAM  circuits  compatible  with  Josephson  junction  digital  energy 
efficient  SFQ  circuits.  A  scalable,  energy-efficient  memory  element  based  on  Magnetic 
Josephson  junctions  (MJJs)  was  developed  and  demonstrated.  For  SIsFS  MJJ,  we 
demonstrated  the  memory  properties  of  two  memory  states  with  different  critical  current 
values  and  high  IcRn  comparable  to  that  of  conventional  SIS  Josephson  junctions.  We 
have  also  demonstrated  a  superconducting  ferromagnetic  transistor  (SFT)  -  a  three- 
terminal  device  with  good  input/output  isolation  for  integration  with  MJJ-based  memory 
cell  capable  of  performing  the  memory  cell  selector  function  in  random  access  memory 
arrays. 

•  We  were  redesigning  the  8 -bit  ALU  in  new  ERSFQ  technology.  In  order  to  achieve 
integration  of  the  full  ERSFQ-based  datapath  (comprising  the  ALU,  the  Register  File, 
and  the  Instruction  Decoder),  we  developed  new  fabrication  process  featuring  6  wiring 
layers  (in  contrast  to  standard  4)  and  0.25-um  lithography  (in  contrast  to  previously 
employed  1.0  um).  The  first  wafer  with  ALU  blocks  and  an  8-bit  adder  will  be  available 
next  month. 

•  During  this  project  period,  our  HYPRES -University  of  Illinois  team  worked  on  the 
development  of  PM  VCSELs.  Specifically,  we  perfonned  analysis  of  the  existing  first 
generation  designs  of  PM  VCSELs  in  order  to  indentify  the  necessary  design 
modifications;  completed  the  design  of  a  photomask  set  for  the  modified  PM  VCSELs 
with  a  new  pattern  of  photonic  crystal;  completed  fabrication  of  the  PM  VCSEL  samples; 
perfonned  testing  and  evaluation  of  fabricated  VCSELs.  While  the  low  threshold  voltage 
was  achieved,  we  identified  layout  modifications  to  improve  the  observed  poor  PM 
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VCSEL  polarization  control.  The  new  design  with  “half-VCSEL”  structure  (with 
dielectric  top  distributed  Bragg  reflector)  was  completed.  The  fabrication  of  new  samples 
is  underway.  In  addition,  we  were  constructing  a  cryogenic  setup  for  cryogenic  VCSEL 
testing.  We  modified  HYPRES  cryogenic  testbed  based  on  Sumitomo  two-stage 
cryocooler  for  mounting  of  PM  VCSEL  samples  on  the  first  stage  of  the  cryocooler.  We 
modified  temperature  control  of  the  first  stage  of  the  cryocooler  to  enable  accurate 
temperature  setting  over  the  45K  to  90K  temperature  range.  We  also  invented  and 
designed  new  on-chip  energy-efficient  drivers  based  on  ERSFQ  logic.  These  drivers  are 
the  re-designed  dc/SFQ  converter  with  biasing  output  stage  via  the  output  data  line  of  the 
PM  VCSEL.  The  new  ERSFQ  dc/SFQ  drivers  are  being  fabricated. 
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